H5W3
当前位置:H5W3 > python > 正文

【Python】python 爬取网站的博客目录

第一步,安装 requests-html

  • 升级 pip
pip install --upgrade pip
  • 升级 urllib3
sudo python3 -m pip install   urllib3   --upgrade
  • 安装 requests-html
 sudo python3 -m pip install requests-html

第 1.1 步,给项目,安装 requests-html

  • 修改 setup.py 文件,

添加

install_requires=[
'requests-html',
],
  • 修改 launch.json

添加

 "pythonPath": "/usr/bin/python3"
  • 命令行,安装
sudo python3 -m   setup install
  • python 文件中,使用
from requests_html import HTMLSession

第 2 步,继续使用 youtube – dl

  • 新建一个信息提取类
 class XxxIE(InfoExtractor):
  • 建立匹配正则
_VALID_URL = r'https?://(?:www\.|m\.)?xxx\.com.+posts?.+'

对应源代码

启动后,

  • 先走 YoutubeDL.py 文件的
 def extract_info(self, url, download=True, ie_key=None, extra_info={},
process=True, force_generic_extractor=False):
# ...
for ie in ies:
if not ie.suitable(url):
continue
# ...
  • 再走 extractor 文件夹下 common.py 文件的
@classmethod
def suitable(cls, url):
if '_VALID_URL_RE' not in cls.__dict__:
cls._VALID_URL_RE = re.compile(cls._VALID_URL)
# ...

2.1 剩下的交给

class XxxIE(InfoExtractor):
  • 先在 extractor 文件夹下的

中引用一下

  • XxxIE 中下载爬取,即可
from requests_html import HTML
class XxxIE(InfoExtractor):
_GEO_COUNTRIES = ['CN']
IE_NAME = 'xxx: blog'
IE_DESC = 'wo qu'
_VALID_URL = r'https?://(?:www\.|m\.)?xxx\.com.+posts?.+'
_TEMPLATE_URL = '%s://www.xxx.com/%s/posts/%s/'
_LIST_VIDEO_RE = r'<a[^>]+?href="https://segmentfault.com/a/1190000038144683/(?P<url>/%s/sound/(?P<id>\d+)/?)"[^>]+?title="(?P<title>[^>]+)">'
def _real_extract(self, url):
scheme = 'https' if url.startswith('https') else 'http'
print("start ya yay  ya")
print("\n\n\n")
self.downloadX(url, 1)
small = list(range(2, 20))
for index in small:
# ?page=2
src = url + "?page=" + str(index)
self.downloadX(src, index)
print("\n\n\n")
return {}
def downloadX(self, src, index):
audio_id = 123456
webpage = self._download_webpage(src, audio_id,
note='Download sound page for %s' % audio_id,
errnote='Unable to get sound page')
html = HTML(html=webpage)
# print(webpage)
jsonElement = html.find('#js-initialData')
jsonInfo = jsonElement[0].text
jsonX = json.loads(jsonInfo)
dic = jsonX['initialState']['entities']['articles']
print("page:    " + str(index) + "  :  ")
for k, v in dic.items():
# pprint(v)
t = v.get('title')
print(t)
print("\n")

代码链接

本文地址:H5W3 » 【Python】python 爬取网站的博客目录

评论 0

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址