An image

Corporate Blogs

Writing a web crawler with Scrapy and Scrapinghub

E-Commerce Businesses Need a Mobile App

A web crawler is an interesting way to obtain information from the vastness of the internet. Large amount of the world’s data is unstructured. Websites are a rich source of unstructured text that can be mined and turned into useful insights. The process of extracting such information from websites is referred to as Web scraping. The information extracted can then be used in several and useful ways.
There are several good open source web-scraping frameworks. We can write a web crawler using such frameworks. Some of such frameworks are Scrapy, and Heritrix.

Scrapy stands out from the rest since it is:

  • Easy to setup and use
  • Great documentation
  • Built-in support for proxies, redirection, authentication, cookies, user-agents and others
  • Built-in support for exporting to CSV, JSON and XML

This article will walk you through installing Scrapy, writing a web crawler to extract data from a site and analyzing it. It is written based on Ubuntu. But it will work in other Linux libraries too.


Scrapy framework is developed in Python, which is already installed in Ubuntu and almost all Linux distributions. So we need to make sure Python is installed with help of the below command:



To get the latest Scrapy version, we will use the pip (Python Package Management System) method. To install pip on Ubuntu along with needed dependency, use the following command:


Installing Scrapy

After installing pip, install Scrapy with the below command:


To make sure that Scrapy is installed correctly, use the following command:


The result should be something like the following:


Using Scrapy:

This article will walk you through these tasks:

  1. Scraping walk through
  2. Creating a new Scrapy project
  3. Creating a new spider
  4. Defining the crawl fields for your web crawler
  5. Extracting the data
  6. Launch the spider

Scraping walk through:

For this guide, we will use an example in which we shall extract products information from one of the largest E commerce sites in India. We will start with the product list and follow the links to each product and scrape some data from each page.

Use the Firebug or Firepath plugin to determine the selectors for the product title and other necessary information. Alternatively, have a look at the code in the next section to view the selector values, as shown below:
Creating a new Scrapy project:

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:


This will create a following directory and their files:

Web crawler example 2

There are several files and one folder named “spiders”. I don’t want to mess it up for you, so I would like you to just concentrate on two things: the file and spiders folder.

Creating a new spider:

After creating project, generate spider with following command:


This creates the file inside the spider folder.

Your initial spider looks like below


This is where we will tell Scrapy how to find the exact data we’re looking for. As you can imagine, this is specific to each individual web page that you wish to scrape.
The first few variables are self-explanatory (docs):

  • name defines the name of the Spider.
  • allowed_domains contains the base-URLs for the allowed domains for the spider to crawl.
  • start_urls is a list of URLs for the spider to start crawling from.

Defining your crawl fields:
When working with Scrapy, you must specify what you want to get after crawling, which is called an item (model). To do this, open the file and add the fields based on your requirement:


Extracting the data:

We need to parse and scrape the data we want, which falls within the DOM element. Again, update like below:


Here, we are iterating through the product list and assigning the product title and price values from the scraped data

Launch the spider:

Ready to launch your crawl spider, simply run the following command:


Also, you can render the output as JSON file with below command:


Deploying the project:

Writing a web crawler is just the beginning – you still need to deploy and run your crawler periodically, manage servers, monitor performance, review scraped data and get notified when spiders break. This is where Scrapy Cloud comes in. Scrapy Cloud is a service of the Scrapinghub Platform. This data scraped can be used in several ways, including deploying the data into private data servers.
With Scrapy, deploying is as easy as running:


For more details about scrapinghub platform, check this Scrapinghub Platform.

Was this helpful? Share it with your network! Subscribe to stay tuned with technology updates you should know.

Turn your vision to magnificent reality With
Our Web and Mobile Solutions
Professional Life of an IT Consultant Interview with Valentin Crettaz
Written by
Technical Architect | Tech Lead | Mentor | - An enthusiastic & vibrant Full stack developer has 7 plus years of experience in Web development arena. Owns legitimate knowledge in Ruby, Ruby On Rails, AngularJs, NodeJs, MEAN Stack, CMS Services. Apart all, A Modest human who strictly says "No To Harming Humans".

Leave a Comment

You have an Idea. We have the Solution.

We help business evolve with lates technologies and infrastructures tailored to their needs and market trends.