Skip to main content

Selenium + Python: An alternate way of web scraping.

Selenium has always been the choice of web developers to test their applications before launch but it also can be used to collect data.

Some of the sites on the Internet require a lot of manual intervention that most of traditional scraping methods will fail to reproduce. What to do then? - Use Selenium!


Is it easy to use - Pretty much.



Prerequisites:


Installing Selenium:

You can download Python bindings for Selenium from the PyPI page for selenium package. However, a better approach would be to use pip to install the selenium package. Python 3.6 has pip available in the standard library. Using pip, you can install selenium like this



pip install selenium


For Windows (Since Linux already has working python)



  1. Install Python using the MSI available in python.org download page.
  2. Start a command prompt using the cmd.exe program and run the pip command as given below to install selenium.

C:\Python36\Scripts\pip.exe install selenium

If the above works, Now you can run your test scripts using Python. 

Note: You can also install selenium remote web driver, but in most cases you will not need it! If you wish to do it anyway - Here are some instructions.

Installing webdriver:

But wait, what exactly is a web driver?
Good Question, The python code is the pilot and web driver is a plane. Web driver look similar to web browsers (Chrome / Firefox / Egde)  but they are not exactly browsers. Comment below if you want a separate article on that.
So to perform scraping from a site using human interaction, you will need webdriver (not browser).

Visit Selenium HQ's download page and locate: Third Party Drivers, Bindings, and Plugins

Here you'd find a list of currently available web drivers.

Let's assume you have decided to try out chromedriver. Download the file and place it in a path folder. 

What are path folders and how to locate them?
Run this:
  
echo %PATH:;=&echo.%

Now you'd get a list like this:

C:\Windows\system32
C:\Windows
C:\Windows\System32\Wbem
C:\Windows\System32\WindowsPowerShell\v10\
C:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static


If your list is any different from this, do not bother its going to be. Pick one folder which you think will never be deleted and place the driver over there and forget about it! (No! Do write it down somewhere, you might need to know where it is so when you are updating you can replace it with a new file)

Lets get down to the business

Your first selenium script! I am excited, are you?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from time import sleep
from selenium import webdriver

driver = webdriver.Chrome('chromedriver')  
driver.get('http://www.google.com');
sleep(3) # Pause execution to see in slow motion.
search_box = driver.find_element_by_name('q')
search_box.send_keys('I am loving selenium!')
# Wait 
search_box.submit()
sleep(5) # Pause execution to see the search text entered.
driver.quit()

Wait a second! That's a lot of code, what just happened there?

A lot of things to be honest.

First is the import line, allows to import sleep module so that we can pause execution for a little bit in between to understand whats happening on the page, otherwise web driver will do everything at a computer speed and we will not understand whats happening.

Second line imports the web driver.

In 4th line, we instantiate a web driver. And on the next line order it to go to google's homepage.

On 6th line we ask the computer to take a brake of 3 seconds while we gaze at the google's homepage.

Here we are locating the search bar by its name (it's q). We assign that element to a variable called search_box and use a function send_keys to send keyboard strokes (type in) the letters 'I am loving selenium!'

search_box = driver.find_element_by_name('q')
search_box.send_keys('I am loving selenium!')

We do want to be able to hit the search button, but how? The next line of code exactly does that.
It calls on function submit to submit the keystrokes! This totally avoids the hassle of locating the search button's element id and clicking on it. (Don't worry, Its definitely possible if you want to though)

On the last line:

driver.quit()

We close the web driver session and release the resources allocated to it. That's it, you just searched on Google using a custom bot. Kudos give yourself pat on the back

Now What?

You have just opened to a new universe of opportunity! There are no limits to what you could do with web scraping. Most startups and existing big businesses do it, and its not going out of business.

You can build apps around real customer data, get new acquisitions validate your users automate routine tasks the list goes on.

You can further explore our blog for interesting reads OR- you can contact us to learn a bit more over a FREE personal Skype coaching session. Just click on "Leave a message" and reach out to us. We get a lot of volume these days so FREE Sessions wont be here for long, Grab this opportunity while you can!

Disclaimer: We are not lawyers; we are simply programmers who happen to be interested in web scraping. You should seek out appropriate professional legal advice regarding your local, federal and state laws before starting on a scraping venture.

Comments

Popular posts from this blog

4. Lex and Yacc Program to detect errors in a 'C' Language Program

Lex and Yacc Program to detect errors in a 'C' Language Program   Lex Code : %{ #include"y.tab.h" #include<stdio.h> int LineNo = 1 ; %} identifier [ a - zA - Z ][ _a - zA - Z0 - 9 ]* number [ 0 - 9 ]+|([ 0 - 9 ]*\.[ 0 - 9 ]+) %% main \(\) return MAIN ; if return IF ; else return ELSE ; while return WHILE ; int | char | flaot return TYPE ; { identifier } return VAR ; { number } return NUM ; \> | \< | \<= | \>= | == return RELOP ; [\ t ] ; [\ n ] LineNo ++; . return yytext [ 0 ]; %% Yacc Code : %{ #include<string.h> #include<stdio.h> extern int LineNo ; int errno = 0 ; %} % token NUM VAR RELOP % token MAIN IF ELSE WHILE TYPE % left '-' '+' % left '*' '/' %% PROGRAM : MAIN BLOCK ; BLOCK : '{' CODE '}' ; CODE : BLOCK | STATEMENT CODE | STATEMENT ; STATEMENT : DECST ';' | DECST { printf ( "\nLine number %d...

Selenium + Python + UnexpectedAlertPresentException: Dealing with annoying alerts

Handling  UnexpectedAlertPresentException   Alerts who hates them? I Do!  Who doesn't hate an annoying alert causing your tests / scraping job to fail? I must say they are pretty much on point on the Unexpected part!  Fortunately, there are easy ways to mitigate the issue. 1. Disable alerts completely: driver . execute_script( 'window.alert = function(){};' ); execute this script just before where you anticipate the alert and you're golden. 2. You want to see the alert text but not disturb the execution flow. driver . execute_script( 'window.alert = console.info;' ); Now the alerts have been redirected to the console and you don't have to worry about them. (Unless you have to - then you'd have to monitor the console) 3. You know exactly when it comes and want to accept the alert and move on. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from selenium import webdriver from selenium.webdriver.s...

2. Lex program that detects statement type i.e. Simple or Compound

Lex program that detects statement type i.e. Simple or Compound Note: Only AND | OR | BUT conjunctions are supported. Program: % option noyywrap %{ char test = 's' ; %} %% ( "" [ aA ][ nN ][ dD ] "" )|( "" [ oO ][ rR ] "" )|( "" [ bB ][ uU ][ tT ] "" ) { test = 'c' ;} . {;} \ n return 0 ; %% main () { yylex (); if ( test == 's' ) printf ( "\n Its a simple sentence" ); else if ( test == 'c' ) printf ( "\n This is compound sentence" ); }