Over the past couple of months I’ve been asked the same question multiple times;
How do you determine related posts?
My default answer would have been: We’re basing this on multiple factors including title, content, categories and tags. While this is true, I thought it would be good to do a detailed post on how we’re determining what posts are related to each other.
Post Indexation
When you install Related Posts for WordPress, you’ll be taken to a wizard. Before you can link related posts to each other, the plugin needs to setup an indexation of your content. This indexation process is what lays the foundation for the plugin to be able to recognize strong relations between posts.
It’s what you write about
Instead of only checking in what categories a post is and what tags are attached to it, Related Posts for WordPress scans your actual content by splitting it up by words.
We scan, count and add the words to a list. We add words of the following items:
- The title of the post
- The content/body of the post
- The title of internal posts you’ve linked to within your content
- The post’s categories
- The post’s tags
- The post’s custom taxonomies (Premium only)
We multiply these words by a factor we call weight, you can read more on weights and how to change them here. By using weights we can make words from certain items more important than others.
We did a lot of testing to come to the following weights, but you’re free to change them in whatever you want of course.
Title | 80 |
Content | 1 |
Links in content | 20 |
Category | 20 |
Tag | 10 |
Custom Taxonomies | 15 |
Cleaning up
Before we actually add words to our list, we filter out some words we don’t to add. We start by stripping out all HTML tags, although essential in your post it’s not what the post is about. Shortcodes are also stripped out, we might parse the shortcode output in the future but for the sake of performance we’re just stripping them out now.
Plain ASCII
The plugin adds all words in plain ASCII, meaning we’ll replace all ‘special’ characters with their normal equal. I’ve ran into a case were a post was written about Pokémon but the author wrote Pokemon instead of Pokémon half of the time. While they obviously meant the same little creatures, the plugin recognized them as separate words. Also depending on your database collation, your database may see them as the same word, leading to unique identifier issues.
Stop words
The next step is to filter out all stop words. As Wikipedia states:
In computing, stop words are words which are filtered out before or after processing of natural language data (text).
Simply put, using the words ‘like’ or ‘the’ a lot in your post doesn’t make it related to other posts that contain a lot of ‘like’ or ‘the’ words. The plugin aims to create strong relations between posts, it wants to create relations based on what the post is actually about. So we strip out these ‘stop words’, leaving only words that are relevant to your post.
Keep the post indexation relevant
If a word occurs less than 3 times in your content, it’s safe to asume it’s not that important in your post. Therefor we remove all words that occur less than 3 times in the list, making the list shorter and the amount of database rows we need to insert lower.
Relative amounts
Instead of storing the amount of times a word occurs in the list directly to the database, we calculate a relative importance. This relative importance is based on the total amount of words in the list. For example, if you use the word WordPress 2 times in a post with 50 words it’s a lot more important than in a post of 1000 words.
Getting related posts
Once the whole indexation process is done we can fetch related posts from the database. This can be done in step 2 of the wizard (step 3 in the Premium plugin), the automatic linking of all existing posts. Related Posts for WordPress only does this ‘fetching the database’ process in the wizard or when you save a post. When displaying related posts on your website, cached links are loaded. This prevents your website from having to determine related posts on each page load, keeping your website fast.
Without pasting the whole SQL query here, I’ll try to explain what the query does step by step.
- Find all words of posts that are indexed in our current post
- Calculate the total sum of relative weights of words that overlap with the curren post
- Sort the words on the total sum explained in previous step
Combining this into a single query results into a set of related posts.
That’s it
And that’s how Related Posts for WordPress determines what posts are related to each other. If you have any questions or suggestions for improvement, please let me know by leaving a comment below.