Python SEO Script to Find Internal Links Missing from Google Rendered HTML

Subscribe to our blog

Stay up to date with all things Impira, automation, document processing, and industry best practices.

By subscribing, I agree to Botpresso’s Terms of Service and Privacy Policy.

ChatGPT has been helping the SEO industry a lot!

Likewise, generating Python Scripts for specific SEO tasks has become child’s play. Here at Botpresso, we have built plenty of Python scripts for SEO.

Last week we shared a Python Script that we generated via ChatGPT that helps in generating Hreflang Tag International Sitemap.

Today, we are sharing an amazing Python script that will help you find internal links that are missing from Google Rendered HTML.

But let’s first delve into why we need this script in the first place.

Problem Statement

Let’s say you got a project that has implemented excellent internal linking throughout the site but still hasn’t generated the expected amount of growth.

The problem could be that internal links aren’t being rendered by Google.

Solution 

Using this script you can find out immediately what links for a webpage are missing from the rendered HTML.

Step 1: Create a raw txt file where you would paste view-source: Code

Step 2: Create rendered HTML txt file where you would paste rendered HTML Code, you can find rendered HTML from Google Rich results tool or your Google Search Console

Step 3: Run the script

And there you have it, in the output you will get all the links that aren’t found in the rendered HTML. Now you can crawl these links using your favorite crawler like Screaming Frog to see which of these are should-be indexable URLs.

And that’s the opportunity you’re missing out on.

Here is the Code

from bs4 import BeautifulSoup

def extract_internal_links(html_file):
    with open(html_file, 'r') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        links = soup.find_all('a')
        internal_links = set()
        for link in links:
            href = link.get('href')
            if href and not href.startswith('http'):
                internal_links.add(href)
        return internal_links

raw_html_file = "./raw_html_file.txt"
google_html_file = "./google_rendered_html_file.txt"

raw_links = extract_internal_links(raw_html_file)
google_links = extract_internal_links(google_html_file)

missing_links = raw_links - google_links

print(f"Number of internal links detected between the two files: {len(raw_links & google_links)}")
if len(missing_links) > 0:
    print(f"Number of internal links missing from Google rendered HTML: {len(missing_links)}")
    print(f"Missing internal links: {missing_links}")
else:
    print("No missing internal links in Google rendered HTML.")

Here is the Replit File that you can fork & start using 

Replit File Link

Here is an example where we can see these many links from the BarnesandNoble website were missing from the rendered HTML

Picture of Kunjal Chawhan

Kunjal Chawhan

SEO Manager at Botpresso