使用RVest从网页中提取名称列表和基础超链接

debugcn 发表于 Dev

安倍

我是网络爬虫的新手，并试图了解如何使用rvest其从网页收集数据。感兴趣的网页是https://www.cabq.gov/office-of-neighborhood-coordination/neighborhood-homeowner-coalition-websites，其中提供了社区组织列表，并提供了指向该组织网站的基础超链接。我试图产生一个数据框，其中第一列是组织名称，第二列是超链接中的URL。

我遵循了一些rvest教程和Stack Overflow问题，试图解析出适当的节点以提取我感兴趣的信息而无济于事。所需的输出看起来像这样（...只是在输出表目标的所需开始和结束之间的中间截断输出）：

| organization                                   | URL                                 |
| ---------------------------------------------- | ----------------------------------- |
| 7 Bar North Homeowners Association             | https://www.7barnorthhoa.com/       |
| Academy Acres North Neighborhood Association   | http://www.aanna.org/               |
....
| Willow Wood Neighborhood Association           | http://www.hoamcoweb.com/willowwood |
| Winrock Villas Condominium Association         | http://winrockvillas.hoaspace.com/  |

我的代码尝试如下。

library(xml2)
library(rvest)
library(tidyverse)

URL <- "https://www.cabq.gov/office-of-neighborhood-coordination/neighborhood-homeowner-coalition-websites"

pg <- read_html(URL)

html_nodes(pg, "external-link") %>% 
  map_df(function(x) {
    data_frame(
      postal = html_node(x, "span") %>% html_text(trim=TRUE),
      city = html_nodes(x, "ul > li") %>% html_text(trim=TRUE)
    )
  })  
#> # A tibble: 0 x 0

^{由reprex软件包（v0.3.0）创建于2021-02-15}

任何帮助都将不胜感激。

巴黎JoséR.Ferrar

首先，我认为您需要使用xpath表达式来获取正确的链接类型。您对external-link类的元素感兴趣，因此可以使用：

html_nodes(pg, xpath="//a[@class='external-link']")

您可以构建更复杂的xpath表达式，使其适合您的需求。然后，您需要提取文本和元素的一个属性，可以使用：

html_nodes(pg, xpath="//a[@data-linktype='external' or 
 @class='external-link']") %>% 
map_df(function(x) {
data_frame(
  organization =  x %>% html_text(trim=TRUE),
  URL = x %>% html_attr("href")
)})

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。