Sphinx in Action: How Sphinx handles text during indexing

This is a post from a series of posts called "Sphinx in Action". See the whole series by clicking here.

When integrating full-text search into your application it's important to think about tokenizing your text, or how it gets split up into words. This is the first process your data goes through once you've built the config and begun indexing.

Sphinx includes a variety of settings that give you full control of the way your text gets tokenized. They are:

charset_table and ngram_chars; lets you define characters that will be treated as normal characters, everything else will be a separator. You can also use:
ngram_chars; If you have to deal with chinese/japanese/korean text whose structure differs significantly from other languages, this can be useful as it allows you to insert a separator after each hieroglyph and instead of one long word you will have each hieroglyph indexed as a separate token, the same will happen with the query allowing your users to find what they're looking for. Example: ngram_chars = U+3000..U+2FA1F
ignore_chars; lets you totally ignore some characters.
blend_chars; allows you make some characters both separators and normal characters as well. For example, if you have twitter-related things like @username in your data and want to allow users to search exactly for the twitter nicks you can easily do so by adding '@' into the blend_chars, then the query '@username' wouldn't find text 'username' while the query 'username' would still find them both. Example: blend_chars = @

Exceptions, wordforms and stopwords

That concludes the character level commands. The next process the text goes through while getting indexed is the word handling level controlled by stopwords, word length directives, exceptions and wordforms.

By using stopwords, you can make one or more lists of words that will not be indexed at all. They may be very frequently used words such as 'a', 'the', 'and', etc... There are a few goals for this: Firstly, it improves the search quality because once these words are excluded from the index and from the query, the latter will become more agile. For example if you have 'on', 'a', 'the' and 'my' in the stopwords, the query 'search on my site' will also find such phrases as 'search on a site', 'search on the site', 'search my site' etc. Secondly, it decreases the size of your index because indexing something which exists in almost all documents will take more resources and thus slow down performance.
The related directive used to configure stopword behavior is stopword_step. If it is set to one (which is a default value) the query “search on my site” will match with any of the following queries: “search on a site”, “search on the site”, or in general "search + ANY_OTHER_STOPWORD + ANY_OTHER_STOPWORD + site". However, if it is set to zero it will also match with “search my site” or “search site”. In other words, all stopwords will just be ignored.
min_word_len; This specifies the minimal word length to index. You need to be careful to correctly use the related directive overshort_step. If it's set to 1 and your min_word_len = 2 the query "search on site" will not match "search on a site", however if it's set to 0 it will match that phrase.
prefix/infix directives help you to filter out things that shouldn't be indexed and on the contrary index substrings that should be. Not only is it important to index whole words, but sometimes it makes sense to be able to search by close variant words, e.g. if you type in "dogs" you might also want to find "dog" or you want "hero" to find "superhero". This is enabled by a variety of directives: min_prefix_len, min_infix_len, prefix_fields, prefix_fields, infix_fields, enable_star, expand_keywords. Using these directives, you can configure exactly how your words should be split into substrings. You can specify how many characters can be trimmed on the end or on the both ends of the word, but also what full-text field this should be applied to, whether to support wild-card syntax (e.g. dog*) or treat all query words as substrings. Be aware that using the prefixes and the infixes increases your index size and might affect the performance.
Exceptions and wordforms Another thing Sphinx allows you to do is define list of words that should be mapped to each other which means you can tell Sphinx to treat USA, U.S.A, US, U.S, America, United Stated, United States of America as one and the same word, for example. To do this you should use the 'exceptions' directive. It works on very low level before even tokenizing the text. Using this, you can map all the words related to a specific one, for example USA and the same will happen with the search query. This might dramatically increase your search quality especially if you have to deal with products that can have different names and abbreviations, but all mean the same (PlayStation, Play Station, PS, Sony PlayStation etc.). Another reason to use the exceptions is to index something which has a stopword or very short word which would otherwise be truncated. As an example, "The Matrix" would be converted to "Matrix" or "vitamin a" which would become "vitamin" when min_word_len = 2 or greater. *Note: Since exceptions work before tokenizing they have to be case sensitive (at least this is how it works now).
Stemming Using stemming you can improve your search quality even more. For instance if you enable English stemmer 'walking', 'walks', 'walked' will be all converted to 'walk'. The same will happen with the query and you will be able to find 'he was walking on the street' by searching for 'walked'. Sphinx enables English, Russian and Czech stemming out of the box and it was built with ––with_libstemmer it supports anything else using the Snowball libstemmer library.
html stripping Another nice feature is html strippping. Often Sphinx is used to search among web pages which are HTML documents and using the 'html_strip' directive you can do the html parsing job. This is important when you still need the markup in your datasource and don't want to store both raw and stripped versions of the text.

To figure out exactly how your current index settings work, you can use the 'call keywords()' function in SphinxQL:

mysql> call keywords('abc a b c the AT&T A&BULL', 'idx');
+-----------+------------+
| tokenized | normalized |
+-----------+------------+
| abc       | abc        |
| b         | b          |
| c         | c          |
| AT&T      | AT&T       |
| bull      | bull       |
+-----------+------------+
5 rows in set (0.00 sec)

You can see that the text 'abc a b c the AT&T A&BULL' was split into words 'abc', 'b', 'c', 'AT&T' and 'bull'. 'a' and 'the' were skipped, because they're stopwords, AT&T is an exception and that's why it was not split into 'AT' and 'T' as happened with 'A&BULL' which was split into 'A' and 'BULL' and then 'A' was not indexed, because it's a stopword.

In the SphinxAPI this function is called 'BuildKeywords()'.