Python Script: Analyzing Content Similarity via Cosine Method

Python-Script-Analyzing-Content-Similarity-via-Cosine-Method

Subscribe to our blog

Stay up to date with all things Impira, automation, document processing, and industry best practices.

By subscribing, I agree to Botpresso’s Terms of Service and Privacy Policy.

Here is the Jump link that will take you directly to the Python Script

Have you ever worked on Programmatic SEO and found yourself wondering that there is bound to be duplicate content, even if the competitors have it?

If yes, then you are not alone in this boat we have all been there. Long story short, it’s okay to have duplicate content in the series of your Programmatic SEO Content pages.

But it comes with a caveat, how much duplicate content would be entertained by Google now that’s a question. Because you can’t have 100% Duplicate content on all the pages with just H1 & Meta Title being unique.

Why you can’t have 100% Duplicate content?

If you do then chances are that Google will not index a lot of your Programmatic SEO Landing Pages and bucket those URLs into Google Chose Different Canonical than user.

For example, you have two landing pages that attract two different segments. 

  • Hair Doctors in New Jersey
  • Hair Doctors in California

Because content is 100% Duplicate it may happen that Google Selected Canonical for California page maybe New Jersey i.e. California wasn’t indexed and as a result, you aren’t able to capture the audience from California.

Moreover, what tools are you relying on to calculate the duplicate content shared between the pages? Most of the tools out there that calculate the internal duplicate content have a different algorithm for calculating it which means perhaps you can’t rely on these tools.

I was faced with this dilemma recently which is when I concluded that I need to build something of my own where I can control the algorithm on how it calculates the duplicate content %.

I have built a Python Script that leverages TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity to calculate the content duplication share. 

The way the script works is TF-IDF Vectorization converts the text content into numerical vectors using the TF-IDF technique, which represents the importance of each word in the documents.

The vectors variable here contains a NumPy array with two TF-IDF vectors.

vectors[0] – gets the 1st vector

vectors[1] – gets the 2nd vector

We pass these individually to cosine_similarity to compare the 1st vector vs the 2nd vector.

So [vectors[0]] and [vectors[1]] are just indexing into the array to get each vector out.

A vector here is just a list of numbers that represents the text document. It’s an abstract mathematical representation.

For example, let’s say we have a very simple document

  “The cat is playing”

We can represent this text as a vector like:

[0.2, 0.1, 0.0, 0.5, 0.05, 0.15]

What does this vector mean?

Each number in the vector represents the importance of a word from the text:

0.2 -> the (very important)

0.1 -> cat (important)

0.0 -> is  (not important)

0.5 -> playing (very important)

etc.

So vectors are numeric fingerprints of the document text.

The TF-IDF vectorizer takes text and converts it into this vector format automatically.

Then we can compare vectors using cosine similarity to see how “aligned” the text documents are based on their vector representations.

The determination of which words are important and what numbers to assign is the key job of the TF-IDF Vectorizer.

Let me explain at a high level how it decides that:

TF-IDF stands for Term Frequency – Inverse Document Frequency. It looks at two things:

Term Frequency: How frequently does this word appear in the document? More appearances means it’s likely more relevant.

Inverse Document Frequency: How common or rare is this word across all documents? Rare words are more informative.

For example:

“the” appears twice in this short text, so it has a high-term frequency. But it’s also very common in English, so low IDF.

“playing” only appears once, but it’s comparatively rare vs other words.

Based on these two values, TF-IDF assigns a final score to each word as its vector value.

So words like “the”, and “is” get lower values even if high term frequency since not very discriminative. Words like nouns get higher values typically.

In essence, TF-IDF detects which words characterize and distinguish this piece of text.

The actual vector values come from doing the related math on term frequencies and document frequencies. But hopefully, this gives an intuition on how it decides importance!

For Calculating Content Similarity can Cosine Similarity be trusted?

Source: TowardsDataScience.com In one word, absolutely yes. What makes Cosine Similarity more powerful in my script is the TF-IDF Vectorizer. Because TF-IDF gives meaning, and importance to the vectors in the context of uploaded content, and on top of that Cosine Similarity does its job of calculating the similarity %. Without Further Ado Here is the Python Script that helps you analyze duplicate content % between two text files.

Find the Script below
				
					import nltk
nltk.download('punkt')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def read_text_file(file_path):
  with open(file_path, 'r') as f:
    text = f.read()
  return text

def calculate_similarity(file1, file2):
  text1 = read_text_file(file1)
  text2 = read_text_file(file2)

  vectorizer = TfidfVectorizer().fit_transform([text1, text2])
  vectors = vectorizer.toarray()

  similarity = cosine_similarity([vectors[0]], [vectors[1]])[0][0]
  similarity = round(similarity, 2) * 100

  print(f"The content similarity between the documents is: {similarity}%")

if __name__ == '__main__':

#specify your txt file names below, the two files that you are comparing

  file1 = "text1.txt"
  file2 = "text2.txt"

  calculate_similarity(file1, file2)

				
			

The Best use case is to check content similarity % in Programmatic SEO Pages between let’s say two location pages. 

This is not to say that if the Programmatic SEO series pages have duplicate content share then it’s bad but the idea is to see how much your website is having versus how much your competitor’s websites are having.

Let’s say you did this analysis & found that your pages have 99% content similarity, but all your SERP competitors have <60% content similarity. 

Then this is an indication that the Programmatic SEO Content template of your website is in dire need of improvement.

content similarity checker output

Here is an example of how I checked the content similarity ratio between two delivery location pages of a company called The Bouqs that sought funding in USA Shark Tank.

In the above screenshot, you can see the content similarity ratio between the two compared texts is 51%.

Pro tip: It’s best to use this Script in Replit rather than other IDEs because in Replit you can create two txt files & simply paste the word content in each & do the similarity check. Whereas in other IDEs like Google Colab, for every distinct document, you will need to create those files separately and upload them using the upload function.

If you want to use this on Replit, then feel free to fork my repl → https://replit.com/@KunjalChawhan/Content-Similarity-Checker 

I hope you found this content useful, then stay tuned for more Python SEO Scripts heading your way.

Kunjal Chawhan

Kunjal Chawhan

SEO Manager at Botpresso