Sphinx in Action: It all starts with indexing

This is a post from a series of posts called “Sphinx in Action”. See the whole series by clicking here.

Sphinx usage in any project usually starts with building the Sphinx configuration and indexing your data. In this article I’m going to point out some cool things that Sphinx provides regarding the configuration and how your data is indexed:

  • First, I have to say that it supports inheritance. This is very useful because you can define things (like the connection to your database) only once and they will be reused for all the other sources, here’s an example:

    source text1
    {
            type            = mysql
            sql_host        = localhost
            sql_user        = b
            sql_pass        = u
            sql_db          = b
            sql_port        = 3306
            sql_query       = select id, body, published, lat, long, category from table
            sql_attr_timestamp      = published
            sql_attr_float  = lat
            sql_attr_float  = long
            sql_attr_uint   = category
    }
    
    source text2 : text1
    {
            sql_query       = select id, user_name, inserted from table2
            sql_attr_timestamp      = inserted
            sql_attr_float  =
            sql_attr_uint   =
    }
  • If you’re a real hater of copy-paste technology, you’ll be happy to know that Sphinx config supports shebang which allows you to create your Sphinx config using your favorite scripting language, for example:
    #!/usr/bin/php
    <?php $m = new mysqli('maindb', 'user', 'password', 'main'); $res = $m--->query("select site_map.id, ip from site_map left join server on site_map.master_id = server.id");
    while ($row=$res->fetch_assoc()) {
            $n = $row['id'];
            $host = $row['ip'];
            echo "
    source chunk{$n} {
        type = mysql
        sql_host = {$host}
        sql_user = user
        sql_pass = pass
        sql_db = c{$n}
        sql_query = select id, {$n} chunk_id, body from a{$n} where id>=\$start AND id<=\$end and crawled=0
        sql_query_range     = SELECT MIN(id),MAX(id) FROM a$n
        sql_range_step = 100000
    }
    ";
    }
    ...

This creates many possibilities of making the Sphinx config really dynamic. It is especially important in large applications that have a lot of indexes and multiple servers. You can set up your config just once, and as you scale your project, it will rebuild itself automatically. One thing you will need to be careful is sending signal to the running searchd process so it starts using the new config. You can make life even easier by incorporating the signal sending into the config so that once you reindex, and see that the config is updated, you can inform your searchd process about this and it will switch to the new config.

Another cool thing you can do with data indexing is using a so-called main + delta scheme. Here’s what it does:

  • It reduces the frequency that the main part of the index needs to be rebuilt.
  • When you update a field in your data source and need to update this in your Sphinx index, you only need to rebuild the delta. When the delta is big enough and takes significant time to rebuild itself, you can dump all the data into the main index. This empties the delta and readies it to accept new data and start rebuilding fast again.
  • It makes your updated data appear in the index much faster.
  • By fetching less data from the database, it reduces the weight of the load on your server.

There are two main approaches when it comes to the main+delta:

  • Split data by ‘id’:
    source main {
            sql_query_range = select min(id), max(id) from dogs
            sql_query       = select id, name from dogs where id >= $start and id <= $end
            sql_range_step = 1000
            sql_query_post_index = replace into sphinx_helper (type, const) values ('dogs', $maxid)'
    ...
    }
    
    source delta : main {
            sql_query_range = select const, (select max(id) from dogs)) from sphinx_helper where type = 'dogs'
            sql_query_post_index = replace into sphinx_helper (type, const) values ('dogs', $maxid)'
    ...
    }

    This is a more traditional method; however, the drawback is that if an older document gets updated and its ‘id’ is unchanged, it won’t be reindexed until main part of the index is rebuilt.

  • Split data by ‘updated’:
    source main {
            sql_query_pre   = REPLACE INTO sphinx_helper set type = 'cats', tmp = (select max(updated) from cats)
            sql_query_range = select unix_timestamp(min(updated)), (select tmp from sphinx_helper where type = 'cats') from assets
            sql_query               = select id, name from cats where updated >= $start and updated <= $end         
            sql_query_post_index = update sphinx_helper set const = tmp where type = 'cats'         
            sql_range_step  = 3600 ... 
    } 
    source delta : main {
            sql_query_range = select const, unix_timestamp() from sphinx_helper where type = 'cats'         
            sql_query_killlist = select id from cats u  where updated > (select const from sphinx_helper where type = 'cats')
            sql_query_post_index =
    ...
    }

    This aims to avoid the drawback of the split by id approach: once an old document gets updated its ‘updated’ field value will be updated as well, and it will be indexed as soon as the delta index is rebuilt. Technically the updated document will now be in both the delta and the main index parts. Using sql_query_killlist allows us to explicitly tell Sphinx to use the one in the delta.
    You should be careful to pick the best sql_range_step value to define the period of time used to fetch docs from the db at once. 3600 means one hour (3600 seconds). If the value is too low, there will be too many queries that don’t return results and it will waste resources and indexing time. On the other hand, if too much time is allowed, too much data could be fetched from the db, which could overload it.

Leave a comment

Your comment