Hreflang XML Sitemap Generator Using Python & ChatGPT

hreflang sitemap generator with chatgpt python code

Subscribe to our blog

Stay up to date with all things Impira, automation, document processing, and industry best practices.

By subscribing, I agree to Botpresso’s Terms of Service and Privacy Policy.

The journey of SEO is full of challenges and surprises. Your traffic drops📉, pages go out of the index, or even penalties. But those challenges are what make our lives worth it. Isn’t it? 

We SEOs always try to find a way to overcome problems, and there lies the path to innovation. And this blog is about one such challenge we faced and how we went out of our comfort zone to solve that.

Let’s begin.

The story 

For the past few months, we have been working with an enterprise client to migrate their domains targeting multiple geographies and languages to their primary domain as subfolders. 

Both domains combined had thousands and thousands of pages, and with that scale, we have to give our 110%. 

Like all other SEO processes, we prepared a checklist with all the checkpoints to ensure a smooth migration.

So we followed the checklist religiously and marked milestones to have a clear way ahead. And at one step, we have to implement hreflang tags for both the geos. 

We decided to add hreflang via HTML tags + sitemaps to ensure Google picks up the right signals. 

But what’s the hurdle there – you might ask. 

Before moving on to that, let’s see a quick intro about the hreflang tag. 

The mighty hreflangs!

Hreflang is an HTML tag that is used to reference the specific location and language that a page is targeting.

<link rel=”alternate” hreflang=”en-US” href=”https://example.com/en-us/” />

If you have a page that targets India, the United States, and Australia, you must use hreflang tags to inform Google that the page is available in three different geographies.

And this will help Google serve the correct page for the users in those regions. 

💭Imagine an e-commerce page that contains a list of products targeting India and references ‘Rupee’ as the currency. What if this page is shown instead of the US page to people in the US? That won’t be a great UX, right? 

With hreflang tags added, you can help Google pick up the signals and show the US version of the page to the users in that region. 

There’s more to an International SEO other than hreflang tags, so read this post to have an in-depth understanding. 

The challenge

Now, we know the importance of hreflang tags in an international website. But what’s the challenge with implementing that?

We had around 25000+ URLs targeting multiple geographies and we had to create sitemaps for each geography with hreflang alternate for each URL. 

Here’s an example sitemap for one URL: 

We surfed the internet for tools & ideas to get this done within our deadline. But guess what? We haven’t found a tool that can help us with generating an international sitemap. 

That’s a bummer! 

The Saviour

So without any solution from Google, we sought help from the ‘Saviour’ of recent times. (You must have guessed it already.)

And your guess is right. We opened a new tab in our Chrome window to log into ChatGPT. And right when we are on the page, we started typing in the prompts to create a tool that can help us generate the international sitemap by crawling the list of pages and their hreflang tags. 

After countless prompts and tortures, ChatGPT gave us a Python script to generate the international sitemap. 

But. (😁When the story goes so cheerful & positive, there should be a U-turn with ‘but’ to end it with a bit spicier climax – the Indian film recipe) 

The code outputted by ChatGPT crawled the pages slower than the opening credits of a James Bond movie. Unfortunately, we didn’t have much time. 

We consulted a developer who then suggested a scrapy Python library instead of beautifulsoup. 

We then went back to ChatGPT and asked it to replace beautifulsoup with Scrapy and guess what – it worked like a charm. 

Voila! We ran the code on repl to generate the sitemap for the list of pages we had. And this helped us save tons of manual hours of work and frustration. 

Here’s a code to generate an international sitemap.

import scrapy


class HreflangSpider(scrapy.Spider):
    name = "hreflang"
    start_urls = []


    def start_requests(self):
        with open('urls.txt', 'r') as f:
            urls = f.readlines()
            urls = [url.strip() for url in urls]
            self.start_urls = urls


        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        # Find all the hreflang tags on the page
        hreflang_tags = response.css('link[rel="alternate"][hreflang]')


        # Create a dictionary to store the hreflang URLs for each language
        hreflang_dict = {}
        for tag in hreflang_tags:
            lang = tag.attrib['hreflang']
            href = tag.attrib['href']
            hreflang_dict[lang] = href


        # Add the default URL to the dictionary
        default_href = hreflang_dict.pop('x-default', None)
        if default_href:
            hreflang_dict['x-default'] = default_href


        hreflang_list = [(lang, href) for lang, href in hreflang_dict.items()]


        with open('sitemap.xml', 'a') as f:
            if f.tell() == 0:
                f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
                f.write('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"\n')
                f.write('        xmlns:xhtml="http://www.w3.org/1999/xhtml">\n')
            f.write('    <url>\n')
            f.write(f'        <loc>{response.url}</loc>\n')
            for lang, href in hreflang_list:
                f.write(f'        <xhtml:link rel="alternate" hreflang="{lang}" href="{href}" />\n')
            f.write('    </url>\n')
            
    def close(self, reason):
        with open('sitemap.xml', 'a') as f:
            if f.tell() > 0:
                f.write('</urlset>\n')

And here’s a repl.it version of the code, where you can just fork the code, and create a sitemap for your international site. 

Learnings

With this endeavor, we learned that persistent effort and using technological advancements in a better way could help us create the beautiful things we imagine. 

There are a lot of AI tools like ChatGPT that can help us automate and simplify the process so that we can focus on other important things that only a human can do. 

So, stop fearing AI taking over and work together with it to create a better future. #happyautomating

Admin

Admin