Selenium - parsing a page takes too long

Milano Published at Dev

Milano

I work with Selenium in Python 2.7. I get that loading a page and similar thing takes far longer than raw requests because it simulates everything including JS etc.

The thing I don't understand is that parsing of already loaded page takes too long.

Everytime when page is loaded, I find all tags meeting some condition (about 30 div tags) and then I put each tag as an attribute to parsing function. For parsing I'm using css_selectors and similar methods like: on.find_element_by_css_selector("div.carrier p").text

As far as I understand, when tha page is loaded, the source code of this page is saved in my RAM or anywhere else so parsing should be done in miliseconds.

EDIT: I bet that parsing the same source code using BeautifulSoup would be more than 10 times faster but I don't understand why.

Do you have any explanation? Thanks

alecxe

These are different tools for different purposes. Selenium is a browser automation tool that has a rich set of techniques to locate elements. BeautifulSoup is an HTML parser. When you find an element with Selenium - this is not an HTML parsing. In other words, driver.find_element_by_id("myid") and soup.find(id="myid") are very different things.

When you ask selenium to find an element, say, using find_element_by_css_selector(), there is an HTTP request being sent to /session/$sessionId/element endpoint by the JSON wire protocol. Then, your selenium python client would receive a response and return you a WebElement instance if everything went without errors. You can think of it as a real-time/dynamic thing, you are getting a real Web Element that is "living" in a browser, you can control and interact with it.

With BeautifulSoup, once you download the page source, there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.

In practice, if you are doing web-scraping and you need a real browser to execute javascript and handle AJAX, and you are doing a complex HTML parsing afterwards, it would make sense to get the desired .page_source and feed it to BeautifulSoup, or, even better in terms of speed - lxml.html.

Note that, in cases like this, usually there is no need for the complete HTML source of the page. To make the HTML parsing faster, you can feed an "inner" or "outer" HTML of the page block containing the desired data to the html parser of the choice. For example:

container = driver.find_element_by_id("container").getAttribute("outerHTML")
driver.close()

soup = BeautifulSoup(container, "lxml")

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-20

Comments

0 comments

From Dev

Related Related

Article

Selenium - parsing a page takes too long

Selenium - parsing a page takes too long

Selenium webdriver with python- how to reload page if loading takes too long?

prop() takes too long

removeFromSuperview() takes too long

Django Admin Million Data - Admin page takes too long to open

cURL takes too long to load

XPages typeahead takes too long

mpirun takes too long to run

JQuery : $.get() takes too long

Mouse takes too long to be discovered

Why dd takes too long?

AVCaptureSession commitConfiguration() takes too long

Timeout function if it takes too long

mpirun takes too long to run

RMySQL dbReadTable takes too long

jarvis takes too long to respond

Skip selenium Webdriver.get() call inside for loop if it takes too long

Cancel Parse Request If It Takes Too Long

Android Studio gradle takes too long to build

Query by coordinates takes too long - options to optimize?

AJAX request takes too long to complete the request

Removing CONSTRAINT in SQL takes too long

UITableView rowHeight animation takes too long

Powershell, Stop-Job takes too long

Php takes too long to be executed, need thread

Core data save takes too long

gulp browserify bundle time takes too long

WIX Bootstrapper Takes Too Long To Begin

Azure blob DownloadToStream takes too long

timeout if method takes too long to finish