Sphinx in Action: Fuzzy matching and 2nd pass query

This is a post from a series of posts called “Sphinx in Action”. See the whole series by clicking here.

Many customers that we have helped with integrating search into their applications wanted their search to be more intelligent than just strictly matching a query with documents.
There are many ways to do this. Sphinx makes it very easy, as fuzzy matching is included out of the box. It consists of two main components:

  1. Quorum operator:
    "computing and technology news"/2
    

    This means that at least two words of the phrase should match, i.e. this query would find texts containing both “computing news” and “technology news”.

  2. Proximity search operator:
    "computing news"~3
    

    This means that there can be less than N non-matching words between the words from the query. Here are a few examples:
    If the text is “a b c d e f g h” :

    • “a h”~7 would find that since “b c d e f h” are 6 non matching words and this is less than 7
      however “a h”~6 wouldn’t find the text
    • “a d h”~6 would find that
    • “a d h”~5 wouldn’t

Many of our customers like the second pass logic when the first query to Sphinx is strict. If nothing is found or not enough results are returned, the second less strict query is issued. There also may be more advanced logic with 3rd and 4th passes. It just depends on your requirements and whether or not you want your user to at least find something in any case. Or, on the contrary, you can let them find something that exactly matches his query.

Sometimes it makes sense to do it the other way around and make the query stricter. For example, if your default matching strategy is ‘any word should match’ and your application doesn’t have any extended syntax to allow users to specify the best query themselves, it makes sense to first try the ‘all words should match’ strategy or even the ‘phrase should match’. This might significantly increase the quality of the search.

In some applications it makes sense to parallelize the 1st/2nd pass queries and so on. This can be easily done using Spinx multiquery. The logic here is to do the 2nd pass query beforehand and if nothing is found by the 1st pass query the results will be ready and since the queries were done simultaneously this improves performance. However, this depends on a lot of things. You should be careful, as it can reduce performance sometimes. These things are:

  • What hardware are you using? If it’s not powerful enough to handle two queries at once or the response time is close to the response is double that of a single query, it makes little sense to use this technique.
  • What is your load? If your Sphinx is already heavily loaded, you will get a worse response time.
  • What are your statistics? If for 99% of queries the 1st pass the query returns results, there’s little point to make the 2nd pass query along with the 1st one, this will just waste resources.

Leave a comment

Your comment

Notify me of followup comments via e-mail. You can also subscribe without commenting.