[python] Analysing HTML code using BeautifulSoup

Ok , so its been a while since my last post , but i’ve been busy with work and college that I literally had no time to write anything. Anyway , i’ve been looking to buy a new phone and i’ve noticed  that some great phones are going for sale very cheap on some classified websites , the only problem is that they get sold in less than 10 minutes , so I decided to make a python script that will play a warning once a phone that matches what I want comes on sale.

The first thing I had to do is read the html code of the target page that contains the ads,

import urllib2
usock = urllib2.urlopen("http://www.donedeal.ie/find/phones/for-sale/Ireland/") 
source = usock.read()
usock.close()

Now the source code is saved in the variable “source” , next I need to parse the html source code so that I can search for the phones that I want and make sure that they are within budget , to do this BeautifulSoup seems to work perfectly.

First import it

from BeautifulSoup import BeautifulSoup

To make things easier I separated the table in the middle from the rest of the code and then analysed each row separately as each row represents a different ad.

To separate a certain HTML tag from the source we can use the findAll property in BeautifulSoup , first lets parse the whole page with BeautifulSoup

search_table = BeautifulSoup(source)

and then i’m going to look for the div in the middle that has the class “text” ,

rows = search_table.body.findAll('div', attrs={'class':'text'})

Now the variable rows contain the HTML source code of all the ads without the top and bottom of the page , just the ad rows , all I need to do now is read each row on its own and read the title , price and how long its been on sale (cause I’m only interested in the new ads).

To read each row on its own I used a for loop as follows:

for line in rows:

In this loop I used BeautifulSoup again to parse each row and read the title price , date and URL , in my example the developer is using <span>’s for price and date so reading them is straight forward using findAll:

price = line.findAll(name = 'span' , attrs={'class':'price'})
dt = line.findAll(name = 'span' , attrs={'class':'publishDate'})

However the title is a bit tricky as it is inside an <a> tag in the <span> , so here is how I read it:

title = line.findAll(name = 'span' , attrs={'class':'header'}) # to read the span which contains the <a> tag
title = title[0].find('a').text #title '.text' is used to read what's between the <a> and </a> tag
link = title[0].find('a')['href'] #url , .you can replace 'href' with any attribute name inside the selected tag to read the value of that attribute

And thats pretty much it , all I did after that is check the title for types of phones that i’m looking for and check the price , if its within budget and the time is less than 6 minutes then it’ll play a warning and print the ad on screen.

here is the full program (make sure you put an mp3 file called ‘alert.mp3’ to play when a match is found)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s