Finding Related Posts in Jekyll

LSI for Jekyll can get slow and did not produce satisfying results for my site. I therefore wrote my own little plug-in, hand-tailored for my needs, that does the same job quite well and quite fast.

LSI stands for Latent Semantic Indexing and is a technique to measure the similarity of two texts. The Jekyll docs say that it is very slow and I suspected it to be responsible for the poor performance of Jekyll when generating my side. Unfortunately, LSI was not the culprit. Jekyll is still slow, even after turning LSI off. Still, I had the feeling that measuring the similarity between posts for my site is done better and faster with the information I provide myself, instead of using math.

The idea is simple. The similarity between two posts is estimated with a point system using factors that are cheap to calculate. If two posts share a link (internal or outgoing), this counts as one point. Each common tag contributes two points. If they are in the same category, that counts as three points. Finally, if they are linked to each other, five points are added. The more points, the more similar two pages are considered.

An ideal solution would maybe be to mix the two approaches. A direct link between two posts strongly suggests that they are related, something that LSI cannot find out. On the other hand, LSI may find relationships that are not reflected in common links or categories. However, there is no simple way to retrieve the results of LSI from a Jekyll hook, and I therefore do not consider the LSI results at all.

Another factor that had to be taken into a account was that my site is multi-lingual. The array with similar posts should only contain references to posts in the same language. I had already created a Jekyll hook that precomputes tag and category counts. That hook was now extended to also calculate the similarity between two posts based on the criteria mentioned above.

As a result you can get the list of similar posts as follows:

  {% for related in site.related[] %}
    <li><a href="{{related.url}}">{{related.title | escape}}</a></li>
  {% endfor %}

The hash site.related uses the page id as the key and is limited to 10 entries. Likewise, you may want to restrict the list to posts that have a similarity index of at least one (or n) points. All that can be easily changed in the source code of the plug-in.

blog comments powered by Disqus