ChatGPT has been helping the SEO industry a lot!
Likewise, generating Python Scripts for specific SEO tasks has become child’s play. Here at Botpresso, we have built plenty of Python scripts for SEO.
Last week we shared a Python Script that we generated via ChatGPT that helps in generating Hreflang Tag International Sitemap.
Today, we are sharing an amazing Python script that will help you find internal links that are missing from Google Rendered HTML.
But let’s first delve into why we need this script in the first place.
Problem Statement
Let’s say you got a project that has implemented excellent internal linking throughout the site but still hasn’t generated the expected amount of growth.
The problem could be that internal links aren’t being rendered by Google.
Solution
Using this script you can find out immediately what links for a webpage are missing from the rendered HTML.
Step 1: Create a raw txt file where you would paste view-source: Code
Step 2: Create rendered HTML txt file where you would paste rendered HTML Code, you can find rendered HTML from Google Rich results tool or your Google Search Console
Step 3: Run the script
And there you have it, in the output you will get all the links that aren’t found in the rendered HTML. Now you can crawl these links using your favorite crawler like Screaming Frog to see which of these are should-be indexable URLs.
And that’s the opportunity you’re missing out on.
Here is the Code
from bs4 import BeautifulSoup
def extract_internal_links(html_file):
with open(html_file, 'r') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
links = soup.find_all('a')
internal_links = set()
for link in links:
href = link.get('href')
if href and not href.startswith('http'):
internal_links.add(href)
return internal_links
raw_html_file = "./raw_html_file.txt"
google_html_file = "./google_rendered_html_file.txt"
raw_links = extract_internal_links(raw_html_file)
google_links = extract_internal_links(google_html_file)
missing_links = raw_links - google_links
print(f"Number of internal links detected between the two files: {len(raw_links & google_links)}")
if len(missing_links) > 0:
print(f"Number of internal links missing from Google rendered HTML: {len(missing_links)}")
print(f"Missing internal links: {missing_links}")
else:
print("No missing internal links in Google rendered HTML.")
Here is the Replit File that you can fork & start using
Here is an example where we can see these many links from the BarnesandNoble website were missing from the rendered HTML