High performance BuildExcerpts() with Sphinx Search

Overview

Since version 2.0.1 Sphinx has ability to build snippets in a parallel mode. It means that Sphinx will use several CPUs to do that. Below is the instruction on how to efficiently do that, but recommend this only if you need to build excerpts for large amount of text like 10-100 Mb.

Sphinx parallel processing is controlled by 'dist_threads' option, which tells searchd how many CPUs should be utilized for search processing. This parameter is also used by BuildExcerpts() API call in combination with 'load_files' option. By default the first parameter of function BuildExcerpts() is expected to be an array of text strings, but if the 'load_files' option is set to '1' then it should contain another thing - array of file names. Each file should contain the text for which you want to build an excerpt. These two options in combination allow Sphinx to build excerpts in a parallel mode which works much faster for huge amount of texts for processing.

But, this implementation has a bottleneck – it requires a file system to be used to read/write the files. If you use a disk it may be 1000 times slower than when it's done in memory, so the right solution is to use file system in memory.

tmpfs does the job, it is a file system in memory, it is supported by the Linux kernel from version 2.4 and up. So I used this to workaround the files read/write performance issue.

File system

How to mount in-memory file system tmpfs:

mkdir /space
mount -t tmpfs -o size=1G,nr_inodes=10k,mode=0700 tmpfs /space

First I created the directory and then mounted tmpfs to that.

Among the parameters I specified file system size = 1 Gb and access permissions only for owner of the directory /space.

My BuildExcerpts() function based on files

function buildExcerptFile($documents, $options = array())
{
        foreach($documents as $doc){
            $file = "/space/".'snip_'.md5($doc).'_'.time();
            file_put_contents($file, $doc);
            $files[] = $file;
        }

        $client = new SphinxClient();
        $client->setServer('localhost', 9312);

        $res = $client->BuildExcerpts( $files, 'index', $keywords,
                array(
                    'around'=>10,
                    'limit' => 300,
                    'load_files' => 1
                    )
                );

        foreach($files as $file){
            unlink($file);
        }

        return $res;
}

Function works in three stages:

1. Convert text documents into temporary files. I choose dynamic file names to prevent file name collisions.
1. Call BuildExcerpts() function. The first parameter contains the list of file names instead of the list of documents and the third parameter contains 'load_files' option equal to '1', which tells BuildExcerpts() to process the documents as files.
1. Remove the temporary files to clear garbage from memory.

Setup dist_threads option

Add the following in the searchd section of your Sphinx config:

dist_threads = 2

I prefer to set dist_threads equal to number of CPUs in the system.

Conclusion

In my testing environment I gained two times better performance compared to the default BuildExcerpts() call performance.

Average size of documents was about 3-10 Mb. I passed 100 documents per one BuildExcerpts() call.