使用R进行网页抓取-未加载完整的网站数据

debugcn 发表于 Dev

冰咖啡

我正在尝试使用R网站抓取以下网站：https ://www.ebi.ac.uk/gxa/genes/ensg00000177455?bs=%7B%22homo%20sapiens%22%3A%5B%22ORGANISM_PART%22%5D% 7D＆ds =％7B％22kingdom％22％3A％5B％22animals％22％5D％7D＃differential

我想要表格中的信息。它不必采用任何特定的格式-我只需要表信息。

但是，当我使用：

library(RCurl)
website = getURL("https://www.ebi.ac.uk/gxa/genes/ensg00000177455?bs=%7B%22homo%20sapiens%22%3A%5B%22ORGANISM_PART%22%5D%7D&ds=%7B%22kingdom%22%3A%5B%22animals%22%5D%7D#differential")

表信息不存在于website对象中。

我以为可能是因为该网站使用的是javascript，但是当我尝试使用PhantomJS进行抓取时，我也未获取表信息。

需要注意的是，我使用的.js脚本是：

#!/usr/bin/env phantomjs

"use strict";

var system = require('system');
var fs = require('fs');

var page = new WebPage()

page.open(url, function (status) {
  just_wait();
});

function just_wait() {
  setTimeout(function() {
    fs.write('temp.html', page.content, 'w');
    phantom.exit();
  }, 2500);
}

谁能建议如何在R中获取此数据？

低

如果右键单击页面，选择“检查元素”并转到“网络”选项卡，则可以看到该页面发出的请求。如果刷新页面，则会看到向https://www.ebi.ac.uk/gxa/json/search/differential_results?geneQuery=%255B%257B%2522value%2522%发出了一个大型XHR（数据）请求253A％2522ensg00000177455％2522％257D％255D＆conditionQuery =＆species = homo + sapiens，其中包含您想要的表。

可以使用以下示例在R中轻松阅读jsonlite：

url <- "https://www.ebi.ac.uk/gxa/json/search/differential_results?geneQuery=%255B%257B%2522value%2522%253A%2522ensg00000177455%2522%257D%255D&conditionQuery=&species=homo+sapiens"

res <- jsonlite::read_json(url)

# the first row
res[["results"]][[1]]

要将嵌套列表结构转换为data.frame，我建议查看https://tidyr.tidyverse.org/reference/hoist.html。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。