Sphinx in Action: It all starts with indexing

This is a post from a series of posts called "Sphinx in Action". See the whole series by clicking here.

Sphinx usage in any project usually starts with building the Sphinx configuration and indexing your data. In this article I'm going to point out some cool things that Sphinx provides regarding the configuration and how your data is indexed:

First, I have to say that it supports inheritance. This is very useful because you can define things (like the connection to your database) only once and they will be reused for all the other sources, here's an example:
If you're a real hater of copy-paste technology, you'll be happy to know that Sphinx config supports shebang which allows you to create your Sphinx config using your favorite scripting language, for example:

This creates many possibilities of making the Sphinx config really dynamic. It is especially important in large applications that have a lot of indexes and multiple servers. You can set up your config just once, and as you scale your project, it will rebuild itself automatically. One thing you will need to be careful is sending signal to the running searchd process so it starts using the new config. You can make life even easier by incorporating the signal sending into the config so that once you reindex, and see that the config is updated, you can inform your searchd process about this and it will switch to the new config.

Another cool thing you can do with data indexing is using a so-called main + delta scheme. Here’s what it does:

It reduces the frequency that the main part of the index needs to be rebuilt.
When you update a field in your data source and need to update this in your Sphinx index, you only need to rebuild the delta. When the delta is big enough and takes significant time to rebuild itself, you can dump all the data into the main index. This empties the delta and readies it to accept new data and start rebuilding fast again.
It makes your updated data appear in the index much faster.
By fetching less data from the database, it reduces the weight of the load on your server.

There are two main approaches when it comes to the main+delta:

Split data by 'id':
Split data by 'updated':