使用 R 从足球参考资料中抓取阵容数据

debugcn 发表于 Dev

杰里米·洛萨克

我似乎总是在使用 Python 或 R 抓取参考站点时遇到问题。每当我在 R 中使用我的普通 xpath 方法 (Python) 或 Rvest 方法时，我想要的表似乎永远不会被抓取器拾取。

library(rvest)

url = 'https://www.pro-football-reference.com/years/2016/games.htm'

webpage = read_html(url)

table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)

for(x in boxscore_links{
  keep = substr(x, 10, 36)
  url2 = paste('https://www.pro-football-reference.com', keep, sep = "") 
  webpage2 = read_html(url2)
  home_team = webpage2 %>% html_nodes(xpath='//*[@id="all_home_starters"]') %>% html_text()
  away_team = webpage2 %>% html_nodes(xpath='//*[@id="all_vis_starters"]') %>% html_text()
  home_starters = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_text()
  home_starters2 = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_table()
  #code that will bind lineup tables with some master table -- code to be written later 
}

我试图刮起首发阵容表。第一段代码提取 2016 年所有 boxscore 的 url，for 循环进入每个 boxscore 页面，希望提取“Insert Team Here”Starters 领导的表格。

这是一个链接，例如：' https://www.pro-football-reference.com/boxscores/201609110rav.htm '

当我运行上面的代码时， home_starters 和 home_starters2 对象包含零元素（理想情况下它应该包含我试图引入的表或表的元素）。

我感谢您的帮助！

亚历克斯·奇萨赞

我花了三个小时试图弄清楚这一点。这就是它应该如何完成。这是我的例子，但我相信你可以将它应用到你的例子中。

"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>%    # select comments
  html_text() %>%    # extract comment text
  paste(collapse = '') %>%    # collapse to single string
  read_html() %>%    # reread as HTML
  html_node('table#returns') %>%    # select desired node
  html_table()

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。