Skip to main content

How To Crawl Coupon Sites With Python

How To Crawl Coupon Sites With Python

In this post, i will show you how to use Python and LXML to crawl coupons and deals from couponsites. The purpose of this post is to help users write crawlers with Python.
To demo this, I will crawl coupons from couponannie.com and couponmonk.us.

Example 1

Let us start with couponannie.com first.
Let us first import the following two libraries..
import requests
import lxml.html
Most of the coupon sites have thousands of coupon pages. Most of the times, these pages are per company or site. These pages are structured templates. Therefore when we write a crawler for one coupon page, then it should work for all the coupon pages. In the case of couponannie also, this is the case.
Let us pick the following url couponannie.com/stores/linkfool and extract the coupons and its related information.
url = 'https://www.couponannie.com/stores/linkfool'
We will use requests to get the content of above page as shown below.
obj = requests.get(url)
Let us convert the data in to a form which lxml can understand.
root = lxml.html.fromstring(obj.text)
If you look at the page url, the coupons are presented in the list form as shown in the snapshot below.
In Chrome browser, Right click on this list page and select option "inspect" from the pop up menu. You will see a dialog box open at the bottom or right end of the screen as shown in the snapshot below.
Now in the developers console, you will see the ul elment with id="rectStoreCard". You would notice that all the coupons are present as list elements under the above ul tag.
We can get hold of these li elements as shown below.
len(root.xpath('//ul[@id="rectStoreCard"]/li'))
19
As we see above, there are 19 list or coupons under the ul tag. For this example, let us grab one.
elem = root.xpath('//ul[@id="rectStoreCard"]/li')[0]
Ok now we have the first element. We can extract all the sub-elements of the above element.
Let us first get the coupon description. The description is inside the div element with class="desc" and inside p tag.
elem.xpath('.//div[@class="desc"]/p')[0].text_content().strip()
'Enjoy Up to 25% Off on this Flash Sale'
Ok, let us see how we can extract the coupon code. Extracting coupon code is tricky. If you notice, to get the coupon code, we need to first click the button "Get Code". Then the site shows the coupon code. This functionality has been implemented using Javascript or Jquery. Therefore to click, we would need Selenium. However for this site, there is a easy way too.
If you notice carefully, each coupon item has "see details" section which has class="see-detail-con". We can get the all detials of a coupon as shown below.
elem.xpath('.//div[@class="see-detail-con"]')[0].text_content().strip()
'Find Enjoy Up to 25% Off on this Flash Sale via coupon code “YHNWFL25”. Apply this promo code at checkout. Discount automatically applied in cart. Exclusions Apply.'
In the above details, coupon code is also given. Of course, we would need to parse the text to extract the coupon out.

Example 2

Ok, Let us do one more example. In this example I will crawl, couponmonk.us couponsite's coupon page.
Ok, let us crawl through the page https://www.couponmonk.us/coupons-for/quizlet.com/
url1 = 'https://www.couponmonk.us/coupons-for/quizlet.com/'
obj1 = requests.get(url1)
root1 = lxml.html.fromstring(obj1.text)
Let us find out the html element of a coupon listing on the above page. Looks like, this site doesnt have any ul or list element and each coupon item is a div element in the div class="card flex-row flex-wrap"
len(root1.xpath('.//div[@class="card flex-row flex-wrap"]'))
6
As we see above there are 6 elements on this page at the time of writing this code. Let us grab the Ist element.
elem1 = root1.xpath('.//div[@class="card flex-row flex-wrap"]')[0]
The text of coupon is inside the p element of the above div tag. Let us grab text out of this p tag using code below.
elem1.xpath('.//p')
[<Element p at 0x7f516e2ebfb0>, <Element p at 0x7f516e2eb170>]
Ok so there are two p tags. Let us check the content of each p tag.
elem1.xpath('.//p')[0].text_content().strip()
'Practice Questions And More, Get 20% Off With Code! - Try code MOMETRIX30'

elem1.xpath('.//p')[1].text_content().strip()
'Added on 2020-June-30'
Ok the first p tag contains the coupon description and 2nd p tag contains the date when the coupon was added.
Extracting the coupon, however involves clicking the "show coupon" button which requires Selenium but the rest of information can be extracted using the ways that I have shown above.

Wrap Up!

Now, time to wrap up this post. I hope this post have given you enough starting material on writing scrapers using Python and Lxml.

Comments

Popular posts from this blog

How To Install R and R Studio Server On Centos

How To Install R and R Studio Server On Centos R is extensively used for data processing and analyzing. R has gained lot of popularity over the last few years because of data explosion over the mobile and web applications.

To leverage the power of R and its eco system, one needs to have complete R suite of tools installed. Although there is large community of R developers and system administrators, I couldn't find a good resource where I could find everything about installing R and its tools in simple easy steps. That's why I decided to write this post. In this post, I will talk about installing following...
RR Studio ServerR Studio Connect Install R Please run following two commands to install R.

sudo yum install epel-release sudo yum install R
Type R -verson in your bash shell. You should see following output depending upon what version of R you have.

To Bring up the R repl. Just type R and you should have your R shell started. To install any package in R, just do install.pack…

Linux CSV Command Line Tool XSV

Linux CSV Command Line Tool XSVlast updated: August 8th,2020 by AadiIn this post, I will introduce an excellent command line tool XSV for the CSV file parsing and data manipulation. You can install the tool from github https://github.com/BurntSushi/xsv
Let us begin the tutorial.
If you are on Centos, following installation commands will work.
yum install cargocargo install xsv
By default cargo installs it in /root/.cargo/bin/xsv
If you are not root, you might have to set either alias or add the tool in your bash PATH.
alias xsv=/root/.cargo/bin/xsvAs an example, I would look at the stocks data which I downloaded from Yahoo Finance for the Apple stock.
Ok let us first look at our data.

Read Csv file using Xsv
Ok, now let us try using the Xsv command "xsv table". Xsv table would show the data in nice tabular form. Instead of Linux 'head -2' command, we can also use Xsv slice command