Basic web scraping using Goutte and Symfony DomCrawler

 

Here, I am going to explain how to perform basic web scraping using Goutte and Symfony DomCrawler, and how to get machine-readable information from Web pages by way of Web scraping. Currently, most of the API documentation process is not written by hand, and such documentations are generated by tools meant for this purpose. There are several tools available in the market for API document generation such as PHPDocumentor or Sami (these are more popular and reliable).

 

Now, interestingly, we will reverse this process of creating documentation from code, and thereby generate code from documents!

Required Installation

 

Before going to use DomCrawler, obviously, you need to install it: https://github.com/FriendsOfPHP/Goutte

 


composer require fabpot/goutte

 

Only after successful installation can we be able to use the Symfony DomCrawler, since Symfony DomCrawler uses the service of Goutte.

 

Now, start a simple DomCrawler to find the available links from the web page.

 

Add the below lines above the class name of the file – src/AppBundle/Controller/DefaultController.php

 


   use Goutte\Client;

   use Symfony\Component\DomCrawler\Crawler;

 

Add the below lines in the bottom of all the methods of the file – src/AppBundle/Controller/DefaultController.php

 


   /**

    * @Route("/links", name="crawler")

    */

   public function crawlerAction()

   {

       $url = "http://www.agiratech.com";

       $client = new Client();

       $crawler = $client->request('GET', $url);

       $links_count = $crawler->filter('a')->count();

       $all_links = [];

       if($links_count > 0){

           $links = $crawler->filter('a')->links();

           foreach ($links as $link) {

               $all_links[] = $link->getURI();

           }

           $all_links = array_unique($all_links);

           echo "All Avialble Links From this page $url Page<pre>"; print_r($all_links);echo "</pre>";

       } else {

           echo "No Links Found";

       }

       die;

   }

 

Here, I have created the new router http://localhost/links for my application (http://localhost is my local domain name) and created one object for Client class and named it as “$client”. Using this object I will call a request method to gather information in that page like the following line

 


$crawler = $client->request('GET', $url).

 

From the line “$crawler->filter(‘a’)->count()” we can find HTML <a> tag count in the particular page (http://www.agiratech.com).

 

 

Therefore, similarly, from this line “$crawler->filter(‘a’)->links()” we can get the all the links form the particular page.

 

Similarly, again, from the line “$link->getURI()” we can get each of the links of the particular page.

Conclusion

 

The above example shows how to extract all the links from the HTML document and save them in an array as ‘$all_links’. Likewise, we can extract several data from the particular web page.

 

In fact, many more powerful activities can be performed and code be extracted. For instance, in the above example, we can even travel into all the pages from the links present, and find many more information as required. I will handle more such extraction performances with different examples in future blogs. Try it out for yourself…