我刚接触起来很陌生,我正在尝试使用CrawlSpider爬行网站,我想基于“下一步”按钮递归地爬行它。但这是行不通的。我认为问题来自于正则表达式,但是我检查了很多次,找不到错误。它仅爬网登录页面,而没有进入下一页。
# -*- coding: utf-8 -*-
start_urls = ['https://shopping.yahoo.com/merchantrating/?mid=13652']
rules = (
Rule(LinkExtractor(allow = "/merchantrating/;_ylt=Anf3hF19R8MGFPwuYuJUny4cEb0F\?mid=13652&sort=1&start=\d+"), callback = 'parse_start_url', follow = True),
)
def parse_start_url(self, response):
sel = Selector(response)
contents = sel.xpath('//p')
for content in contents:
item = BedbugsItem()
item['pageContent'] = content.xpath('text()').extract()
self.items.append(item)
return self.items
改用XPath:
rules = (
Rule(LinkExtractor(
restrict_xpaths = [
"//div[@class='pagination']//a[contains(., 'Next')]"
]),
callback = 'parse_start_url',
follow = True),
)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句