Sphinx in Action: Good and bad in Sphinx real time indexes

This is a post from a series of posts called “Sphinx in Action”. See the whole series by clicking here.

Sphinx has supported real time indexes since version 1.10 was released. Ever since, they have been getting more stable and robust, and now it is ready for use in production. Many people like this because it’s really simple to understand (i.e. no indexing, crontasks, main + delta schemes, and so on), but anyone who wants to use them should also be aware of the drawbacks of this when comparing it to traditional monolithic indexes.

A real time index consists of 2 parts: one that is stored in memory, and the other is stored on a disk. Once a new insert/update query is sent to a real time index, it updates the memory as well. It works very fast, but the memory isn’t unlimited. So as the number of queries begin to reach it’s maximum and once rt_mem_limit is exceeded on the memory side, it gets converted into a disk index chunk, and so on. If you have a 10Gb index and rt_mem_limit is set to 1Gb, then you will end up with 10 disk chunks. If you set re_mem_limit to 10Gb you will have only 1 disk chunk. Now here’s the dilemma: the more disk chunks you have the worse your search performance will be, but on the other hand less memory is needed to support the real time index. On the contrary, if you set rt_mem_limit to a high value you will have good search performance because you will have fewer disk chunks, but it will take up more memory on your server. Unfortunately, the amount of memory that a Sphinx real time index requires is much more than what a traditional index needs. This is because a traditional index only stores attributes and wordlists while a real time index stores everything else as well until it’s converted into a disk chunk.

Here’s what it looks like for the same data (1M docs) when rt_mem_limit is high enough:

Traditional index:
[root@SE01 snikolaev]# ls -sh idx.sp*
12M idx.spa
186M idx.spd
8.0K idx.sph
11M idx.spi
4.0K idx.spk
4.0K idx.spl
4.0K idx.spm
103M idx.spp
8.0K idx.sps

Real time index:
[root@SE01 snikolaev]# ls -sh idx_rt.*
8.0K idx_rt.kill
4.0K idx_rt.lock
8.0K idx_rt.meta
442M idx_rt.ram

The bolded lines are the things that are stored in memory. As you can see, the traditional index requires 23Mb while the real time index needs more than 400Mb of memory.

In practice, we usually recommend real time indexes in 2 cases:

  1. When the data volume is really small and is not going to grow quickly. Indeed, it doesn’t actually matter whether you spend 5Mb or 100Mb if you have few gigs of memory and using the real time indexes will make sense for you because you will avoid having an indexing routine and will be able to synchronize your database and your search index on a data insert level.
  2. When the data volume is large and growing, but you want to reduce indexing latency (i.e. you want your data to appear in the index in a real time manner). And here comes the best part of using real time indexes. ¬†They can be combined with traditional indexes using a Sphinx distributed index. What you can do in this case is still use the main + delta scheme, but make the delta real time, once the main part is rebuilt you then should clean the real time delta index. Since the delta is real time it enables real time data accessibility in the index and because it’s the delta, it doesn’t require a lot of memory. The only routine is to flush it periodically to the main part. To empty the real time index the latest Sphinx builds provide “TRUNCATE RTINDEX” SphinxQL command.

8 Comments

Yaroslav VorozhkoApril 10th, 2012 at 6:07 pm

Sergey thanks, great article.
What was size of the textual data?

“TRUNCATE RTINDEX” command is not documented yet.
So, two variants: it is not recommended or it is documentation bug.

Sergey NikolaevApril 11th, 2012 at 2:54 am

it was 1M docs dataset. TRUNCATE RTINDEX is documented in Sphinx sources, just not yet released/published on the site.

Nacer CherJuly 20th, 2013 at 9:05 pm

can you give me a idea about numbers,
for example, an index containing 10 million lines equivalent a title of an article each one, how much ram we will need to work with real-time indexes?

Sergey NikolaevJuly 22nd, 2013 at 6:56 am

Hi Nacer

10M docs (one doc = a typical article subject + some one integer attribute) index takes ~120M in RAM for a traditional index. RT index will take from 120M up to ~400M depending on rt_mem_limit.

Nacer CherAugust 1st, 2013 at 4:37 pm

That seem good !
If we made real-time index using the ATTACHE command, the memory amount will be the same ?

Sergey NikolaevAugust 13th, 2013 at 10:39 am

Hi Nacer

If you run ATTACH the memory required will be the same as for the index you’re converting from no matter what rt_mem_limit is.

asdSeptember 28th, 2014 at 5:23 pm

RT as delta has some problems. If attr in main = 1 and in delta = 0 it will be searchable via both values and no matter what position of index in select query. Detailed here: http://sphinxsearch.com/forum/view.html?id=12811

ajayMay 11th, 2017 at 10:13 am

xml indexing possible in sphnix

Leave a comment

Your comment