使用R进行网页抓取

CKre 发表于 Dev

CKre

我在从网站上抓取数据时遇到了一些问题。首先，我在网络抓取方面没有太多经验...我的计划是使用R从以下网站抓取一些数据：http ://spiderbook.com/company/17495/details?rel=300795

特别是，我想提取此站点上文章的链接。

到目前为止，我的想法是：

xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795")
sources <- xpathApply(xmltext,  "//body//div")
sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " "))) 
sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"',x))

但这并不会显示预期的信息。一些帮助将不胜感激！谢谢！

最佳克里斯托夫

耶洛华德

您选择了一个棘手的问题来学习。

该站点使用javascript加载文章信息。换句话说，该链接加载了一组脚本，这些脚本在页面加载以获取信息（可能是从数据库）并将其插入DOM时运行。htmlParse(...)只是获取基本的html并对其进行解析。因此，您所需的链接根本不存在。

AFAIK解决此问题的唯一方法是使用该RSelenium程序包。该软件包实质上允许您通过看起来像浏览器模拟器的HTML传递基本HTML，该模拟器确实运行脚本。问题Rselenium在于您不仅需要下载软件包，还需要“ Selenium Server”。此链接对进行了很好的介绍RSelenium。

完成此操作后，在浏览器中检查源代码将显示文章链接全部位于具有href的锚标记的属性中class=doclink。使用xPath可以直接提取。从来没有使用过regex来解析XML。

library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer()        # download Selenium Server, if not already presnet
startServer()           # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open()            # open connection
remDr$navigate(url)     # grab and process the page (including scripts)
doc   <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"                                                                                    
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
# [7] "http://www.calcharge.org/2014/07/"                                                                                    
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-14

我来说两句

0条评论

登录后参与评论

上一篇：传递Treeview和TreeNode的通用参数

来自分类Dev

Related 相关文章

文章