googleads
Implementing Web Scraping In Python Using Scrapy
preloder
Quick Tips

Implementing Web Scraping In Python Using Scrapy

 

What Is Scrapy

Scrapy is an application framework which will act like a web crawler that mainly used to extract the data from the website. Today, our topic is very much bound to explore about Scrapy hence we’re going to implement web scrapping in Python using Scrapy in our project.

This blog will hopefully cover the following topics :

  1. How To Install Scrapy
  2. Create A Scrapy Project
  3. Export Scraped Data As CSV

Scrappy will only run on python 2.7 and python 3.4 or run above. If you’re using Anaconda, you can install the package from the conda-forge channel packages on Linux, Windows and OS X.

How To Install Scrapy:

You can install scrappy either using conda or if you’re familiar with the installation of Python packages, you can install Scrapy and its dependencies from PyPI itself.

Install Scrappy Using Anaconda

 

Install Scrapy Using PyPI

 

Install Scrapy On Ubuntu 14.04 Above

Ubuntu 14.04 and above, If you install scrapy on Ubuntu systems, you need to install these dependencies:

 

Install Scrapy On Python

If you want to install Scrapy on Python 3, you’ll also need Python 3 development headers:

 

Inside a virtualenv, you can install Scrapy with pip :

 

Create A Scrapy Project

Before you start scrapping, we need to create our scrappy project. Now, switch to the desired directory where we should run the scrapy project.

 

This will create the following directory structure:

 

The two most important files we should consider are:

settings.py – This file will hold all the settings you have set for your project.
spiders/ – This folder will store all your custom spiders used in the project. 

Related : Introduction To Web Scraping With Node JS

 

Create A Scrapy Spider :

Spiders are the classes which you define and that Scrapy uses to scrape information from a website (or a group of websites).

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

 

The Spider subclasses scrapy.Spider and defines some attributes and methods:

Name: which indicates the spider, the name must be unique in the project and we can’t assign the same name to another file.

start_requests(): return our request in an iterative way so when the crawl begins then our request will be processed successively from the initial request to end.

parse(): This method is mainly called to handle our response in download, based on our “request.Response” method is an instance of TextResponse that holds the page content.

Other side, The parse() method will also parse the response and extract the crawled data as dicts & finds new URLs to follow and creating new requests (Request) from them.

How To Run Spider From Scrapy

To make your spider work, go to the project’s top level directory and run:

 

This command will run the spider and generate following output,

 

Also Read: Writing a web crawler with Scrapy and Scrapinghub

 

Export Scraped Data As CSV :

We can still extract all the data in the command line but it is always good to export the scraped data in various formats like CSV, Excel, JSON, etc. This saves lots of our time and also can be imported into programs else wherever we want. To make this process even easier, Scrapy provides the functions called “nifty” which allows you to export the downloaded content in various formats.

To do that, just add the following code block in settings.py file:

 

That’s all guys! we have successfully exported the data as CSV. Now we know to implement web Scraping Using Scrapy.

The following two tabs change content below.

Allan Watts

An active software developer with “Can Do” attitude. Around 3.5 years of experience, Allan developing skills in php that allows him to craft flawless web applications in Symfony, Laravel, CodeIgniter, Javascript & WordPress. Also the most accompanied developer for building applications from designer perceptive.

Leave a Reply

Your email address will not be published. Required fields are marked *

[contact-form-7 id="120788" title="Web Page Form"]

Schedule Your Call