Sphinx in Action: Top related queries

This is a post from a series of posts called "Sphinx in Action". See the whole series by clicking here.

Many of our customers want to improve their search engine by allowing users to refine their searches by showing them top or related queries from previous days or weeks. When they click on these links, they are taken to a corresponding search results page.

This works by storing all user queries to your database and then indexing them for the last N days with Sphinx. Then you simply make a query against the index to find the most popular related queries that match (or are similar) to the current query.

Here are a few tricks to add these features in the most efficient way:

Use the query text hash as the id, this will allow you to avoid grouping and therefore decrease the response time, Sphinx will ignore duplicated IDs during the indexing stage or when you insert (if you are using a real time index).
Some similar queries might differ slightly, but not significantly (eg. 'MYSQL innodb', 'mysql innodb' and 'mysql, innodb' mean the same). You probably want all of them to be merged into one. To do this, you need to normalize the texts. This is where you can use the 'call keywords()' command (which is named BuildKeywords() in Sphinx API), this can help you tokenize the query exactly the same way Sphinx will tokenize it while indexing, all you need to do is concatenate the returned keywords and you will have the query cleaned from everything insignificant.

So your index structure would be like this:

id: hash (normalized query)
field: normalized query
attr: count
attr: id in db

You can use the main+delta scheme to make indexing faster.
You can use UpdateAttributes() (or corresponding SphinxQL 'UPDATE' query) to increment the query hit count and index less frequent. This is especially useful if you don't have a lot of new queries coming and mainly you need to update the existing ones. Remember one thing in this case: if you have a lot of hits per second you might encounter a collision issue since Sphinx cannot increment attributes instantly, it can only update it, so you will have to increment on the application side and if you do it simultaneously in a few processes you could end up with less increments than expected.
If you have a lot of memory and performance is more important for you than memory consumption you can put the query in its original state (before the normalization) into a string attribute. This way you won't need to make an additional query to the db to find the query text and also you won't need to store the query id in the index. The structure in this case is:

id: hash (normalized query)
field: normalized query
attr: count
attr: original query

Since the queries index is usually not very large you can use a real time index and then you won't need to reindex it at all. If you do this, you might also want to periodically delete older queries. This is a good way to make the id based not just on hash(query), but also on the hash(query + date) and the hash(query) which will be an additional attribute as well as 'date'.

So your index structure will be:

id: hash (normalized query concatenated with date)
field: normalized query
attr: count
attr: hash(normalized query)
attr: date
attr: id in db / original query

Then you need to do two more things in the Sphinx query:

Filter out older queries
Group by the hash (normalized query) and sort by sum(count)

One more thing to mention about related queries is that it makes sense to include only queries that actually produce results. People don't like to click and be taken to "nothing is found" page. The problem here is there might be a query that previously returned results, but doesn't find them any longer. You may have deleted them from your main search index. If you're worried about this, it makes sense to do two things:

Add one more attribute 'last_total_found' to the queries index and update it along with updating the 'count' attribute. Thus the "nothing is found" issue can happen only for the first user who clicks on the link.
Periodically recheck of all your queries to be sure they're still producing results even though this might take additional resources.