H5W3
当前位置:H5W3 > 其他技术问题 > 正文

Scrapy DOWNLOADMIDDLEWARE(selenium+PhantomJS)无法获取Cooike

计划利用Scrapy+selenium+PhantomJS的方式实现某论坛的数据抓取,其中涉及登陆用Scrapy的FormRequest.from_response提交请求,未自定义中间件时登陆正常,但是自定义dowanloadmiddleware后中,request无法获取cookies,代码如下:

#spider.py
class mTeamSpider(CrawlSpider):
    cookie_jar = CookieJar()
    name = '*'
    allow_domain = ['*']
    start_urls=['*']

    rules = (
        Rule(LinkExtractor(allow=(r'details.php\?id=\d+')), callback='parse_detial_item'),
    )

    headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4,zh-TW;q=0.2,ja;q=0.2",
    "Cache - Control": "max - age = 0",
    "Connection": "keep-alive",
    "Content - Length": "35",
    "Content-Type":" application/x-www-form-urlencoded",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
    }

    def start_requests(self):
        return [Request("https://tp.m-team.cc/adult.php", meta={'cookiejar': self.cookie_jar}, callback=self.post_login)]

    def post_login(self, response):
        print('Preparing login')
        return [FormRequest.from_response(response,
                            url='*/takelogin.php',
                            meta = {'cookiejar' : response.meta['cookiejar'],
                                    },
                            # headers = self.headers,
                            formdata = {
                            'username': settings['FROM_USERNAME'],
                            'password': settings['FROM_PASSWORD'],
                            },
                            callback = self.after_login,
                            dont_filter=True
                            )]

    def after_login(self, response) :
        if '*' in str(response.body):print('Success')
        else:print('login fails')
        with open('filename.html', 'wb') as f:
            f.write(response.body)
        for url in self.start_urls :
            yield scrapy.Request(url,meta = {'cookiejar': response.meta['cookiejar']},dont_filter=True)
#middleware.py
class JSMiddleware(object):
        if spider.name=="*":
            print("PhantomJS is starting...")
            driver=webdriver.PhantomJS(executable_path=r"./phantomjs/bin/phantomjs",
                                       desired_capabilities=dcap)

            url = str(request.url)
            driver.get(url)
            content=driver.page_source
            driver.close()
            return HtmlResponse(request.url, body=content, encoding='utf-8', request=request )

根据COOKIES_DEBUG的显示,如果开启中间件,就无法获取cookies了,尝试利用extract_cookies的方法提取cookies也失败了。

        cookie_jar = response.meta['cookiejar']
        cookie_jar.extract_cookies(response, response.request)

重载cookie.py的process_request方法也不奏效,请问有没有什么别的方法,还是说登陆必须用selenium的方法

本文地址:H5W3 » Scrapy DOWNLOADMIDDLEWARE(selenium+PhantomJS)无法获取Cooike

评论 0

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址