Generating a sitemap with Ruby on Rails and uploading it to Amazon S3

Sitemap generators allow webmasters to easily generate sitemaps for their websites instead of manually preparing it in a spreadsheet, or by writing a script. There are many ways to generate a sitemap for a website in a secure way. For example, if you have a WordPress site, then many sitemap generating plugins are available.

 

Here I was working for a Client project based on Ruby on Rails and had to generate a sitemap for my project. Generating a sitemap is beneficial and generating one using Ruby on Rails will be a breeze for developers like us. Here I have made it much simpler and discussed the step by step procedure of generating sitemap and uploading it to Amazon S3. Hope this article helps you when you come across a similar situation.

Before we dive into the process of generating a sitemap. Let’s understand What a sitemap can actually do:

What is a sitemap?

A sitemap is a protocol to get your sites URLs properly indexed on search engine bots for crawling and having a better positioning.  It shows the way the website is organized and how each page is interconnected with the content of the website and how each page is navigated from one hierarchy to the next hierarchy. Using sitemaps, webmasters will be able to include information about URL like the last updated status, the frequency of the changes, and its relation to other URLs on the site. This makes the crawling process more insightful.

Normally it would look like below, if you need more details, please check sitemaps.org


<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<url>

<loc>http://www.agiratech.com/</loc>

<lastmod>2005-01-01</lastmod>

<changefreq>weekly</changefreq>

<priority>0.9</priority>

</url>

<!-- More URL definitions -->

</urlset>

We have several sitemap schema definitions (shortened here), and after that, we get all the URLs to be mapped and indexed.

We can automate this process with help of sitemap generator gem. You can also build it manually using XML builder or hand-craft an XML file.

Using the gem

This gem beneficial since it follows Sitemap 0.9 protocol. Apart from regular links, it supports images, video and Geo sitemaps too.

First, Start by adding this to the Gemfile:


gem 'sitemap_generator'

once you run bundle install, run the below rake task to have a default config/sitemap.rb file you can edit[/code]


rake sitemap:install

Simple Example

Here is a simple example


# Set the host name for URL creation

SitemapGenerator::Sitemap.default_host = "http://www.agiratech.com"

# pick a safe place safe to write the files

SitemapGenerator::Sitemap.public_path = tmp/sitemaps/'
SitemapGenerator::Sitemap.create do

 add clients_path, priority: 0.9

 add team_path, priority: 0.8

 add about_path, priority: 1.0

 add contact_path

 add blogs_path, changefreq: 'weekly'
 Blog.find_each do |blog|

    add blog_path(blog.slug), lastmod: blog.updated_at, priority: 0.7,  changefreq:         'never'

 end

end

There are few things you need to note here

  1. Set default_host to your root website URL. The search engines reading your sitemap need to know what website they are dealing with.
  2. Set public_path to tmp/sitemaps to write our sitemap files before uploading.
  3. Adding URLs, see below for more details

Adding URLs

call add in the block passed to create to add a path to your sitemap. 

The blogs_path has the changefreq set to weekly, as we want to indicate the site crawlers and indexers information about how often that index is likely to change. If we were to publish a new blog every day, we could set it to daily.

The about_path, we’ve used the priority parameter and set it to 1.0 as we want it to be considered as the most important page for indexers and crawlers since we want this page to appear first in search results.

The last addition is more interesting, as they relate to indexing dynamic content. On our blog model we are using  slug in the URL, so instead of having http://www.agiratech.com/blogs/1 we have http://www.agiratech.com/blogs/sitemap-generation. To get the blogs indexed the correct way, we need to add the URL for each blog searching by the slug.

Additionally, we’ve set the changefreq to never, as once a blog is published, it’s unlikely to be changed.

Generating the sitemaps:

The gem provides a series of tasks to create your sitemap


rake sitemap:create


The above task generate the compressed XML file under the folder specified in the public_path


rake sitemap:refresh

The above task does the same as the previous ones, but it will ping Google and Bing search engines so they know to fetch your newly created sitemap and update their indexed information about the site. You can ping other search engines as well, as stated in the docs.

Finally, you should set a cron job on your server to call rake sitemap:refresh as often as needed.

Uploading the sitemaps to s3

Normally, using the default configurations and working on a VPS should not add difficulties to search engines to fetch your sitemap from your public folder, as the file would be reachable from, following with our example: http://www.agiratech.com/sitemap.xml.gz.

However, in the case our application is hosted on Heroku, we face two problems, due to its ephemeral filesystem:

  1. We can’t write on the public folder. That’s why we use the tmp folder on our previous sitemap configuration file.
  2. We can’t guarantee for how long will be in the tmp folder what we save there.

To get around this, what we need is to host our generated sitemap somewhere else, and then allow the search engines to access it. The Sitemap Generator gem offers ways to save the generated file on S3 using fog or carrierwave, so if you already use either of those on your application, you can have a look at this wiki page. However, installing Fog or Carrierwave just for this can be a bit overkill, so here’s a way to do that depending only on the aws-sdk gem.

Once we have the aws-sdk gem installed, we will also need to have an Amazon S3 bucket and the proper credentials set on the corresponding Heroku configuration panel, and/or your local environment, for tests

  • An S3 Access Key Id: ENV[‘S3_ACCESS_KEY_ID’]
  • An S3 Secret Access Key: ENV[‘S3_SECRET_ACCESS_KEY’]
  • The name of the bucket to use: ENV[‘S3_BUCKET’]

Once this is set in settings.yml, we will need a rake task like the following:


namespace 'sitemap' do

 desc 'Upload the sitemap files to S3'

 task :upload_to_s3 => :environment do

Aws.config.update({

  :region => Settings.sitemaps.aws.region,

:credentials=>Aws::Credentials.new(Settings.sitemaps.aws.access_key_id, Settings.sitemaps.aws.access_key_secret)

})
Dir.entries(File.join(Rails.root, "tmp/sitemaps/")).each do |file_name|

  next unless file_name.include?('sitemap.xml.gz')

  file = File.read(File.join(Rails.root, "tmp/sitemaps/", file_name))
  s3 = Aws::S3::Client.new

  object = s3.put_object(:bucket => Settings.sitemaps.aws.bucket,

                         :key => file_name,

                         :body   => file,

                         :acl => 'public-read')
  puts "Saved to S3: #{Settings.sitemaps.aws.bucket}/#{file_name}"

end

 end

end

Using above task, we’ll write the file to our remote bucket, under a sitemap folder, which should be configured as writable on your AWS panel.

Finally, we will need a rake task that we can program on our cron that takes care of everything: create the sitemap, upload it to S3 and ping the search engines:


Rake::Task["sitemap:create"].enhance do

 if Rails.env.production? && Settings.sitemaps.ping_enabled?

     Rake::Task["sitemap:upload_to_s3"].invoke      SitemapGenerator::Sitemap.ping_search_engines(:sitemap_index_url => "https://#{Settings.sitemaps.aws.bucket}.s3.amazonaws.com/sitemaps/sitemap.xml.gz")

 end

end

We are extending default rake task using enhance.Note that on the last invocation, we’re sending the search engines the URL where they can find our sitemap. But the file is not on our server

Configure sitemap in robots.txt

Robots.txt is a standard used by websites to communicate with web crawlers and other web robots. In your public/robots.txt, set Sitemap to the URL of your remote sitemap endpoint:


Sitemap: https://#{Settings.sitemaps.aws.bucket}.s3.amazonaws.com/sitemaps/sitemap.xml.gz

With the help of scheduler or cron, we can automate the above rake task using below command Schedule sitemap in cron


rake sitemap:refresh

Conclusion:

Sitemaps are particularly beneficial on websites in the following cases:

  • If an area of a website is not available through a browser interface.
  • Search engines normally don’t process Ajax, Flash or Silverlight content, If a webmaster uses this kind of content then having a sitemap can be beneficial.
  • If our site is huge, The web crawlers may sometimes look only for new content. Also if you have many pages in your website there are chances that they are not well linked. So it is beneficial to h a ve sitemap in this cases.

I hope this post is informative and helpful to you. Being a Ruby on Rails expert generating this sitemap just took me a few minutes. Our team at agira technologies have worked on different projects using Ruby on Rails. Follow us to know more about our Ruby on Rails works.