Python Script to Detect Website Publishing Velocity

Kunjal Chawhan

Subscribe to our blog

Stay up to date with all things Impira, automation, document processing, and industry best practices.

By subscribing, I agree to Botpresso’s Terms of Service and Privacy Policy.

Have you ever wondered how handy it would be to measure the content publishing velocity of any website within minutes? Well, now you can!

But before jumping into the details, let’s understand what content publishing velocity really means.

What is Content Publishing Velocity?

Content publishing velocity simply refers to the number of new content pages published on a website over a given period. This could be daily, weekly, biweekly, or monthly.

Why Does Publishing Velocity Matter?

The idea here is not to emphasize quantity.

SEO is undoubtedly a quality-over-quantity game. However, websites aren’t black and white either.

For instance, if you’re helping a client whose SEO success depends heavily on content-led growth, publishing velocity becomes a critical factor. This is particularly true for niches like publisher SEO and SaaS SEO.

Consider this:

What if your competitor is aggressively publishing new content while you’re not?
What if that’s what’s giving them an edge over you?

To address this, we’ve built a simple Python script that calculates publishing velocity by analyzing XML sitemaps.

What You Need Before Running the Script

XML Sitemaps containing date information.
Advertools Python Library for sitemap scraping.
Google Colab is the IDE that runs the script and visualizes trends.

Here’s the Google Colab link for quick access: Click Here

Step-by-Step Guide to Measure Publishing Velocity

Step 1: Install Required Libraries

Start by installing the necessary packages to scrape sitemap data and visualize the results. Run the following command:

				
					# Install the necessary packages
!pip install advertools plotly pandas

This installs Advertools for sitemap scraping and Plotly for creating data visualizations.

Step 2: Import Required Libraries

Now, import the libraries you’ll need for data handling and visualization:

				
					# Importing required libraries
import advertools as adv
import pandas as pd
import plotly.express as px

Step 3: Specify the Sitemap URLs

List the URLs of the sitemaps you want to analyze. You can add as many as you need:

				
					sitemap_urls = [
    "https://domain.com/sitemap_2024_week47.xml",
    "https://domain.com/news-sitemap.xml"
]

You can include multiple sitemaps, even if you have hundreds of them.

Step 4: Download URLs & Create Dataframe

Create a function to download and parse the sitemaps, which will be combined into a DataFrame for easy handling:

				
					# Function to download and parse the sitemap and create a DataFrame
def sitemap_to_df(sitemap_urls):
    sitemap_df = pd.DataFrame()
    for sitemap in sitemap_urls:
        temp_df = adv.sitemap_to_df(sitemap)
        sitemap_df = pd.concat([sitemap_df, temp_df], ignore_index=True)
    return sitemap_df

Step 5: Fetch URLs from the Sitemap

Use the function created in Step 4 to fetch and merge data from all the sitemaps:

				
					# Extracting URLs from the sitemaps
sitemap_df = sitemap_to_df(sitemap_urls)

Step 6: Handle Published Date & Remove Duplicates

It’s important to remove any duplicate URLs to avoid skewed results. This function will help:

				
					# Handling the published date and removing duplicates
sitemap_df = sitemap_df.drop_duplicates(subset=['loc'])

Step 7: Extract and Handle Publication Date

Next, extract the publication date from the sitemaps, ensuring that if there are multiple date fields, the latest one is used. If a date is missing, it will be marked as NaT (Not a Time).

				
					if 'lastmod' in sitemap_df.columns and 'news_publication_date' in sitemap_df.columns:
    sitemap_df['published_date'] = sitemap_df['lastmod'].fillna(sitemap_df['news_publication_date']).fillna(pd.NaT)
elif 'lastmod' in sitemap_df.columns:
    sitemap_df['published_date'] = sitemap_df['lastmod'].fillna(pd.NaT)
elif 'news_publication_date' in sitemap_df.columns:
    sitemap_df['published_date'] = sitemap_df['news_publication_date'].fillna(pd.NaT)
else:
    sitemap_df['published_date'] = pd.NaT

Step 8: Filter Valid Published Dates

Now, we’ll filter out any rows where the published date is missing or invalid, ensuring the dataset only includes valid entries.

				
					# Filtering rows that have valid published dates
sitemap_df['published_date'] = pd.to_datetime(sitemap_df['published_date'], errors='coerce')
sitemap_df = sitemap_df.dropna(subset=['published_date'])

Step 9: Adjust Date Format

Next, we’ll simplify the date format by extracting only the date (ignoring the time):

				
					sitemap_df['published_date'] = sitemap_df['published_date'].dt.date

Step 10: Count Pages Published per Day

Group the data by publication date to calculate how many pages were published each day:

				
					publishing_velocity = sitemap_df.groupby('published_date').size().reset_index(name='no_of_pages')

Step 11: Visualize Publishing Trends

Finally, use Plotly to create a line graph that shows publishing velocity over time. This will help you identify trends and activity peaks.

				
					# Plotting the publishing velocity
fig = px.line(
    publishing_velocity,
    x='published_date',
    y='no_of_pages',
    title='Publishing Velocity Over Time',
    labels={
        'published_date': 'Date',
        'no_of_pages': 'Number of Pages Published'
    }
)

# Displaying the Plotly visualization
fig.show()

What You’ll See?

After running the script, you’ll get a clear graph showing the number of pages published over time. This will help you identify trends, compare competitor activity, and make better decisions for content-led SEO.

Final Thoughts

This script is an effective way to measure and visualize website publishing velocity. Whether you’re benchmarking competitors or analyzing your own publishing frequency, it provides valuable insights with minimal effort.