Python for SEOs

by Barbara Coelho | May 21, 2020 | 0 Comments

Programming can be thought of as a communication vehicle among ourselves, developers, and computers to solve difficult problems, but oftentimes we outsource our coding needs. 

Why would we want to code if we can hire someone to do it for us? In reality, coding is not just an extra skill to have under your belt; it is a powerful tool that can open the door to endless possibilities. Lacking programming skills can actually become a constraint, as can poor vision. 

For Hamlet, it wasn’t until his late 20s until he realized he needed glasses. When he finally got a pair, he was awestruck—Hamlet couldn’t believe he was missing out on crisp, clear, vision that entire time. 

Coding is the same way—you don’t know what you’re missing until you learn how to do it. Not only will you be able to see problems in a new light, but you’ll also spot multiple ways to solve them. 

Another advantage of learning this skill is the ability to bring in new perspectives to developers, who can, without a doubt, produce error-free code, but may sometimes hit a wall when they face a tough problem. 

In this context, you can bring in new, fresh ideas to the table that can benefit your whole team. 

Here we’ll be covering practical SEO applications of Python for:

  • data extraction
  • preparation
  • analysis and visualization
  • machine learning
  • deep learning
Watch Hamlet’s presentation: 

View Hamlet’s slide deck:

Python for SEO from Hamlet Batista

 

Hamlet’s Colab Notebook

In this exercise, Hamlet focuses on a recent experience he had where a client moved from Ecommerce V3 to Shopify. Below is the exercise’s output, where you can pinpoint the problem and which pages were affected by it. 

Solution Part 1

The first step in this phase is to pull the data from Google Analytics. One thing to take into consideration when pulling the URL from Google Query Explorer is the tokens, which allow the authorization to take place and expire in about an hour. A limitation of solely using the query explorer is that you are limited to about 10,000 requests, so this code will allow you to build a powerful data frame by iterating and paginating through the set to build it out completely, making it much simpler on the user side. 

Next, we will store the data in a Pandas data frame.

We will follow up by performing an analysis to isolate the pages that lost traffic, similarly to the way joins are used in SQL databases. Using an outer join, we can analyze the data by how many new users visited each page:

  • difference > 0, winners
  • difference = 0, no change
  • difference < 0, losers

Since it may be very easy to make mistakes, Hamlet added a line of code to ensure that the totals add up. 

Now here’s the results, a reflection of all your time and hard work. Pandas also makes it easy to export this data into csv and excel files.

Solution Part 2

In this phase, we’ll focus on getting more useful output by determining the types of pages that lost traffic by using Regex. Since we moved from one platform to another where the URLs are completely different, we’ll have to crawl the original URL and compare it to the final URL, so that we’re comparing apples to apples.

First, we’ll crawl old pages to follow redirects. It’s important here to make sure the status code is within 301, 302, 307, as sometimes checking 300-399 will fail since some 300 codes (such as 304) have a completely different meaning. Adding sleep will ensure you stay on good terms with the developers since you’re taking the site down to crawl it.

Next, we will group using regular expressions. The Rexes show the pattern of URLs of categories in Shopify.

Here’s the beautiful output, where we can see that the collections were the most significant loss for the site. Despite being a lot of work, the results prove to be worth it.

Note that this Winners vs. Losers analysis can also be used when you hit a home run with a client. Sometimes you may not know exactly what you did to enable such great success, so this may help you to understand better and pinpoint the exact solution.

Solution Part 3

Since Regex is primarily useful for pages that have patterns and smaller sites, machine learning can prove to be more useful when handling more data. Instead of matching page groups with Regex, we will match them automatically with the help of BeautifulSoup and Scikit-learn.

First, we’ll collect training data. A bottoms up approach, used by engineers to solve problems, involves analyzing class IDs, HTML tags, etc. to find patterns in the data.

However, for someone who isn’t an engineer, there is a simpler solution. 

In our exercise, Hamlet observes that product pages have big images, while product detail pages have smaller images, but more of them, which he decides to use as his training dataset.

Using this simple idea, he builds an easily verifiable model. Trying to get dimensions of images in HTML may not be accurate, so he approximated by looking at the size of the image (width * length).

When we create the images, we corral them into 50 different categories by using bins. 

That’s the hardest part: organizing the data in the right way, but the machine learning is really only the one line of code at the end. 

 Then, we use a grid search with standard parameters to find the best fit model.

But wait… We can do better with Deep Learning!

Solution Part 4

In this phase, we’ll be able to learn what caused the problem more granularly on the page. 

Although deep learning is typically used with natural language processing, you can use computer vision as well. 

Here we’ll be using TensorFlow and Keras.

This phase can be done automatically using the concept of the Information Bottleneck Theory, which highlights the difference between data in a 3-dimensional and compressed format. 

In the compressed format, the computer can clearly dissect elements, learn the classifications of the data, and store it to analyze which pages are performing poorly.

The steps include labeling a few thousand web screenshots with the visual features you find essential, training a computer vision model to predict more granular page groups, and finding the best model. 

Programming can be useful to solve interesting problems—consider these tips to help you get started:

  • work backward from problems
  • isolate the tools needed
  • utilize stack overflow to find your solution

Additional resources:

Python for Data Science Cheat Sheet

Intro to Pandas for Excel Super Users

Pandas Cheat Sheet

A Simple Cheat Sheet for Web Scraping with Python

An SEO’s guide to XPath

Scikit-learn Cheat Sheet

Efficiently Searching Optimal Tuning Parameters

Cross Validation and Grid Search For Model Selection in Python

Google AutoML

Tensorflow Object Detection API

 

Barbara Coelho

Social Media Manager

0

REPLIES

Comments are closed.

Try our SEO automation tool for free!

RankSense automatically creates search snippets using advanced natural language generation. Get your free trial today.

OUR BLOG

Latest news and tactics

What do you do when you’re losing organic traffic and you don’t know why?

Book Summary: Sales Engagement by Medina, Altschuler, and Kosoglow

Book Summary: Sales Engagement by Medina, Altschuler, and Kosoglow

My name is Jason Levin and I am a Marketing Consultant for RankSense. I help our users achieve their goals with our technology and educational resources.  I am a big fan of business books, and enjoy sharing my findings with others.  Last month, I covered The Sales Acceleration Formula by Mark Roberge. In this article,...

READ POST

Tutorial: Implementing “View All” Canonical Tags to Fix Pagination Issues

Whether it’s used on an e-commerce site or a forum, pagination is the process of dividing category pages into several smaller sets in order to display information in a more organized manner. While pagination is a great and efficient technique, SEO issues can often arise within Content Management Systems (CMS), specifically with canonicalizing these paginated...

READ POST

Tutorial: How to Use Unavailable_After Tags to Increase SEO Crawling Efficiency

Table of Contents Introduction to Unavailable_After Practical Uses Implementing “Unavailable_After” Using the RankSense SEO Tool Using the RankSense SEO tool to verify your changes Introduction to Unavailable_After Put simply, unavailable_after tags send a signal to search engines letting them know that the contents of a page should not be crawled after a certain date. Unlike...

READ POST