Selenium - parsing a page takes too long

Milano

I work with Selenium in Python 2.7. I get that loading a page and similar thing takes far longer than raw requests because it simulates everything including JS etc.

The thing I don't understand is that parsing of already loaded page takes too long.

Everytime when page is loaded, I find all tags meeting some condition (about 30 div tags) and then I put each tag as an attribute to parsing function. For parsing I'm using css_selectors and similar methods like: on.find_element_by_css_selector("div.carrier p").text

As far as I understand, when tha page is loaded, the source code of this page is saved in my RAM or anywhere else so parsing should be done in miliseconds.

EDIT: I bet that parsing the same source code using BeautifulSoup would be more than 10 times faster but I don't understand why.

Do you have any explanation? Thanks

alecxe

These are different tools for different purposes. Selenium is a browser automation tool that has a rich set of techniques to locate elements. BeautifulSoup is an HTML parser. When you find an element with Selenium - this is not an HTML parsing. In other words, driver.find_element_by_id("myid") and soup.find(id="myid") are very different things.

When you ask selenium to find an element, say, using find_element_by_css_selector(), there is an HTTP request being sent to /session/$sessionId/element endpoint by the JSON wire protocol. Then, your selenium python client would receive a response and return you a WebElement instance if everything went without errors. You can think of it as a real-time/dynamic thing, you are getting a real Web Element that is "living" in a browser, you can control and interact with it.

With BeautifulSoup, once you download the page source, there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.


In practice, if you are doing web-scraping and you need a real browser to execute javascript and handle AJAX, and you are doing a complex HTML parsing afterwards, it would make sense to get the desired .page_source and feed it to BeautifulSoup, or, even better in terms of speed - lxml.html.

Note that, in cases like this, usually there is no need for the complete HTML source of the page. To make the HTML parsing faster, you can feed an "inner" or "outer" HTML of the page block containing the desired data to the html parser of the choice. For example:

container = driver.find_element_by_id("container").getAttribute("outerHTML")
driver.close()

soup = BeautifulSoup(container, "lxml")

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related