Automatically generate site-wide meta descriptions with Python + BART for PyTorch. by Reres Finton | July, 2022

Automatically generate site-wide meta descriptions with Python + BART for PyTorch.  by Reres Finton |  July, 2022
Share it:
Rate this post

A Guide on How to Efficiently Create Quality Summarized Content for Search Engine Optimization (SEO)

photo by Mohammad Rahmani Feather unsplash

In my previous articles, I explored the many applications for Natural Language Processing (NLP) implementation in the realm of digital marketing and e-commerce. This article is oriented towards technical search engine optimization (SEO) professionals who are comfortable with Python, the basics of NLP, and simple data mining.

The purpose of this article is to provide readers with a guide on how to quickly generate SEO meta descriptions for an entire site using a simple Python script. The script will be divided into four main sections:

  1. Domain URL Extraction and Cleanup
  2. Data mining of each domain URL
  3. NLP pipeline applied to mining data
  4. Meta description standardization

This program requires several libraries to function. We start with a brief description of each library with its respective installation instructions.

data organization dependencies

#Data organization dependenciesfrom os import remove
import pandas as pd

pandas Python is a software library written for the programming language for data manipulation and analysis.

web scraping dependencies

#Web scraping dependencies#!pip install beautifulsoup4
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen#!pip install requests
import requests
#!pip install justext
import justext
import re
  • beautifulsoup4 Is a library that makes it easy to scrape information from web pages. It sits on top of an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
  • urllib.request The module defines functions and classes that help with URL opening, basic and digest authentication, redirection, cookies, and more.
  • requests There is a popular open source HTTP library that makes working with HTTP requests easy.
  • jusText HTML is a tool to remove boilerplate content, such as navigation links, headers, and footers, from pages. It is primarily designed to preserve text containing complete sentences and is therefore suitable for creating linguistic resources such as web corpora.
  • re There is a module that provides regular expression matching operations.

nlp dependency

#NLP dependencies!pip install transformers
from transformers import pipeline
#Summarize textual content using BART in pytorch
bart_summarizer = pipeline(“summarization”)
  • huggingface transformers JAX is a state-of-the-art machine learning module for PyTorch and TensorFlow. Transformer provides thousands of pre-trained models to perform tasks on various modalities such as text, vision and audio.
  • transformers This will be used in the program to generate a pipeline(), which allows simple import of any model from the model hub.
  • To generate the meta description, the program would need to import a model capable of making inferences on a textual data task. There are many models capable of accomplishing this task, and some are written in tensorflow (Google’s Deep Learning Framework) or pytorcho (Meta framework).
  • In this program, a Pytorch model will be used, namely bart, Bart is a sequence-to-sequence pre-training model for natural language construction, translation, and comprehension. Bart will be used to form a summary of the textual data – eg, the summarization pipeline will be loaded.
  • Start the BART and summary pipeline with the following command:
# Summarize textual content using BART in pytorchbart_summarizer = pipeline(“summarization”)

Comment: Python IDEs do not support Pytorch or Tensorflow in their default states. It is possible to install any framework on the local machine. said that, google collab Both frameworks are installed by default and are the recommended platform for deep-learning operations without local installation.

This eliminates the list of all required dependencies.

For the purposes of this article, the website of a shawarma restaurant named Program will be parsed as bark From London, Ontario, Canada.

To generate meta descriptions for all pages related to a website, a complete list of active URLs for the respective domain is required. For this, the request from urllib.request The module will request a URL, and the urlopen function will open the specified URL.

#Obtain list of links on a domaindomain = ""req = Request(domain, headers='User-Agent': 'Mozilla/5.0')html_page = urlopen(req)

Comment: 403 errors can occur when scraping the content of a site. It is recommended to bypass HTML response errors through user agents specified in the header parameters of the request. User agents allow servers and network peers to identify the application, operating system, vendor and/or version of the requesting user agent.

Then the BeautifulSoup module will be implemented to parse the HTML page of the corresponding open URL. To parse an HTML document, the BeautifulSoup constructor would be configured to use lxml’s HTML parser with the following command:

soup = BeautifulSoup(html_page, “lxml”)

Next, instantiate an empty list to store the domain’s active links. Where for each link in BeautifulSoup’s Parse tag defines hyperlink, add relative to empty list href Attribute indicating the destination of the link.

links = []for link in soup.findAll('a'):
links.append(urljoin(domain, link.get('href')))

Some attached URLs a . can be None Thus, they may be of a non-conforming URL structure, or they may be duplicates of other URLs belonging to the domain.

remove None Type object from linked list using list comprehension:

#Links cleanup for None typeslinks = [link for link in links if link is not None]

Remove items from the list that do not conform to the standard URL structure:

#Links cleanup for non-URL objectsfor link in links:
if not link.startswith("http"):

Remove duplicate objects in the list to create a unique set:

#Links cleanup for duplicate URLslinks = [link for n, link in enumerate(links) if link not in links[:n]]

Lastly, remove any potential social links that may be running off-site. These will likely not be parsed correctly and will only serve as additional noisy data for the summarization module to process. Feel free to add any additional socials that may remain on the domain under inspection.

#Links cleanup for social URLssocials = [“instagram”, “facebook”, “twitter”, “linkedin”, “tiktok”, “google”, “maps”, “mealsy”]clean_links = []for link in links:
if not any(social in link for social in socials):

Given a previously created list of URLs that belong to a domain, the next step is to parse each individual HTML page for text content that can then be summarized in meta descriptions.

Start by instantiating a new list to store all the text content.

# Extract and clean text from each linkcontent = []

The following is a nested conditional for-loop, which will be explained in terms of a single code block. The purpose of the code block is as follows:

  • For each link in the already created list of links, request the URL of each link requests module, invoking request.get function (the same warning applied to HTML error handling is valid here as well).
  • For each link, use justext Module to extract text into a variable called paragraphs.

Comment: The recommended implementation of the JustText module for textual data extraction is typically of the form seen below:

paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
print paragraph.text

Note (continued): In some instances, Justext can be too aggressive in removing incorrectly classified boilerplate content. As such, the following option is used where more textual information is obtained at the cost of boilerplate accuracy:

paragraphs = justext.justext(response.content, justext.get_stoplist(“English”))
  • For each link, a variable named for_processing The processed paragraph is initialized to store the data.
  • For each link, a set of block words (block_words) are created to exclude href destinations that are not of interest. These destinations can be modified to suit the needs of your respective application.
  • For each link, for each extracted paragraph, convert the extracted paragraph object to text.
  • For each link, for each extracted paragraph, if the paragraph length is less than 50 characters, remove it from processing. This situation exists to provide summary model paragraphs of sufficient length – very short paragraphs are more difficult to interpret and further compact.
  • For each link, for each removed paragraph, observing the first condition, omit any removed paragraphs that contain any of the above block words.
  • Finally, for each link, for each removed paragraph, combine up to 5 paragraphs into one entry, and append the same entry. contents list.

Comment: Why are there only 5 paragraphs linked together for one entry? This is done to improve the speed of model summarization and reduce the computational time per summary task. In addition, this is done to ensure that no more than 1024 tokens exist per entry, given the summing model’s limit of 1024 tokens. Feel free to change this parameter to a higher value if the returned paragraph text is insufficiently rich in data.

  • Alternative: arrange links And content data in a pandas DataFrame for fast validation of the results so far.
The final code block outlining the data mining of each domain URL.

The next step is to implement the NLP pipeline from the text content of each page to mine data. Recall that the NLP pipeline was instantiated with the following command:

# Summarize textual content using BART in pytorch
bart_summarizer = pipeline(“summarization”)

Proceed by creating a new empty list to store all the summarized content.

summarized_descriptions = []

Remember that all text content was stored in a variable called contents, We will use the same variable and iterate through each item in contents List, such that each item is processed by the NLP pipeline.

It is optional to print the initial item with its processed version for realtime viewing in the IDE.

# Summarize content for meta descriptionssummarized_descriptions = []for item in content:
print(bart_summarizer(item, min_length = 20, max_length = 50))
summarized_descriptions.append(bart_summarizer(item, min_length = 20, max_length = 50))

Comment: Three parameters are passed to the bart_summarizer function: the item to be processed, min_lengthAnd this max_length,

Please note that both the minimum and maximum length of the sequences to be generated are measured token, not the characters. Since the number of characters per token is often variable, it is difficult to guess the correct value for each use case.

Meta description best practices indicate a character range of 120 to 60 characters — adjust min_length And max_length Parameter accordingly to meet these best practices.

Currently, the data summarized from the deep learning model is within the variable summarized_descriptions, The next step is to standardize the data for easier reading and exporting.

Note that the final summary is stored as a list of dictionaries summarized_descriptions,

The summarized_descriptions variable By default there is a list of dictionaries.

Start by unpacking the variables to extract only the values ​​of the list of dictionaries. Store the values ​​in a variable called meta_descriptions,

#Retrieve values from list of dictionaries in summarized_descriptionsmeta_descriptions = [summary[0][“summary_text”] for summary in summarized_descriptions]

The process of clearing the data will be iterative. For simplification, the steps required to complete each task are assigned to iterations of the same variable.

Initiate common data cleanup by instantiating a variable named meta_descriptions_clean1 as an empty list.

#General data cleaningmeta_descriptions_clean1 = []

for each detail in meta_descriptions,

for description in meta_descriptions:
  • Clear leading and trailing spaces
#Clean up leading and trailing spaces:description = description.strip()
  • clear excessive spaces
#Clean up excessive spacesdescription = re.sub(‘ +’, ‘ ‘, description)
  • Clear extra space between punctuation marks
#Clean up punctuation spacesdescription = description.replace(‘ .’, ‘.’)
  • clear incomplete sentences
#Clean up incomplete sentencesif “.” in description and not description.endswith(“.”):
description = description.split(“.”)[:-1]
description = description[0]
  • Add the cleared details to meta_descriptions_clean1

Manual and Auto Truncation

This script should follow the meta description for SEO best practices discussed earlier. All descriptions must be approximately 160 characters and only exceed that character limit if no other summary can be generated.

The purpose of this program is to output optimal truncation points for manual review, as well as auto-truncate meta descriptions wherever possible.

To enable this feature, full sentences must be present in all meta descriptions. Make sure a period is found at the end of each statement with the following command, storing the relevant values ​​in a list called: meta_descriptions_clean2,

meta_descriptions_clean2 = []#Add a period to all sentences (if missing)for description in meta_descriptions_clean1:
if not description.endswith(“.”):
description = description + “.”

Now that the meta descriptions are formatted correctly, define a function where the index of the desired punctuation character is identified. In this case, the desired punctuation mark is a period indicating the end of a sentence.

#Find the index of the punctuation character desireddef find_all(string, character):
index = string.find(character)
while index != -1:
yield index
index = string.find(character, index + 1)

To display viable truncation points for the user’s manual review of each detail, store the truncation point values ​​in a list called truncation_points, This list will store the character coordinates of the relevant description where a description can be split, such that the final 160 character limit is adhered to.

# Store truncation pointstruncation_points = []character = “.”for description in meta_descriptions_clean2:
indexes = list(find_all(description, character))

To auto-truncate the meta description, initialize a new list for the auto-truncated values meta_descriptions_clean3, for each detail in meta_descriptions_clean2, If the length of each description exceeds 160 characters, and more than 1 period is present in the description, truncate the description by the last sentence in the description. If truncation results in a description that doesn’t end in a period, reformat it correctly by adding punctuation marks. Otherwise, the length of the description does not exceed 160 characters, and simply add it to meta_descriptions_clean3 list.

# Auto truncatemeta_descriptions_clean3 = []character = “.”for description in meta_descriptions_clean2:
if len(description) > 160 and description.count(character) > 1:
split = description.split(character)[:-2]
description = character.join(split)
if not description.endswith(“.”):
description = description + “.”

Next, verify the character count of both the non-truncated and auto-truncated meta descriptions.

#Verify length adherence for non-truncated descriptionslen_meta_descriptions = []for description in meta_descriptions_clean2:
#Verify length adherence for auto-truncated descriptionstruncated_len_meta_descriptions = []for description in meta_descriptions_clean3:

Finally, organize all the relevant variables into a pandas dataframe for easy viewing.

# Organize results into dataframedf = pd.DataFrame(‘link’: clean_links, ‘content’: content, ‘meta_descriptions’: meta_descriptions_clean2, ‘description_length’: len_meta_descriptions, ‘truncation_points’: truncation_points, ‘truncated_descriptions’: meta_descriptions_clean3, ‘truncated_length’: truncated_len_meta_descriptions)df
Viewing the last pandas data frame of all outputs.

Share it:

Leave a Reply

Your email address will not be published.