Interesting thing about BM25 in Sphinx Search

I recently faced strange behaviour of Sphinx Search. As I investigated, the problem turned out to be in the default ranking mode SPH_RANK_PROXIMITY_BM25 which is using the BM25 algorithm.

Update: according to the latest documentation SPH_RANK_SPH04 solves the problem. Anyway I suggest to read the article, it was interesting investigation.

Here's the quote from Sphinx documentation:

Statistical rank is based on classic BM25 function which only takes word frequencies into account. If the word is rare in the whole database (ie. low frequency over document collection) or mentioned a lot in specific document (ie. high frequency over matching document), it receives more weight. Final BM25 weight is a floating point number between 0 and 1.

So it states that a word which is mentioned a lot receives more weight. But it's not always true.

Let's say we have the following table with some words repeated in every row:

mysql>select * from titles;

| id | title |

| 1 | Test Category - Test Article |
| 2 | Test CategoryA |
| 3 | Test CategoryB |

Please notice that "Test" occurs two times in document #1 and only one time in documents #2 and #3.

So we can expect that a search for the word "Test" will match #1 as the first result.

Let's test it using SphinxQL (Sphinx query language):

mysql> select * from test1 where match('Test') ;

| id | weight |

| 2 | 1319 |
| 3 | 1319 |
| 1 | 1252 |

Hmm, something's wrong here, document #1 got lower weight than documents #2 and #3.

As it was investigated the key problem is that the word "Test" occurs in each document. See the quote from Andrey Aksenoff (creator of Sphinx Search):

This is actually by BM25 design. It penalizes (!) the keywords that are overly frequent, ie. occur in more than 50% of the collection documents.

In my case the word "Test" occurred in 100% of documents and that's why was penalized.

A possible workaround of this can be to use another ranking mode which doesn't use BM25, i.e. SPH_RANK_WORDCOUNT or SPH_RANK_MATCHANY:

SPH_RANK_WORDCOUNT:

mysql> select * from test1 where match('Test') option ranker=wordcount;

| id   | weight |

|    1 |      2 |
|    2 |      1 |
|    3 |      1 |

SPH_RANK_MATCHANY:

mysql> select * from test1 where match('Test') option ranker=MATCHANY;

| id   | weight |

|    1 |      1 |
|    2 |      1 |
|    3 |      1 |

It looks like SPH_RANK_WORDCOUNT solves our problem.

In some cases it can be a solution, but in general it can't, because for 99% of queries we might want to use as smart ranking algorithm as default SPH_RANK_PROXIMITY_BM25 ranking mode.