Plain, RT and mixed indexes performance comparison

Plain index is an index which has ‘source’ section in the Sphinx config file and filled by means of ‘indexer’ tool.

Real time (RT) index doesn’t have ‘source’ section, it only has fields and attributes defined in it’s ‘index’ section of the Sphinx config. RT index filling cannot be done by such Sphinx tool like ‘indexer’, instead of that the application which uses Sphinx should upload data to the RT index itself.

Mixed index is a distributed index which refers to plain and RT indexes, for example:

index distributed
{
type = distributed
local = plain_main_index
local = real_time_increment_index
}

Using such config we can work with plain and RT indexes simultaneously.

Performance comparison

I’ve made few comparative tests of each of the sphinx types:

  1. disk space usage
  2. single query performance
  3. multi query performance

All testes were run against 4 different datasets (based on wikipedia articles): 10K, 100K, 1M and 2M documents.

Diskspace utilization comparison

I tested diskspace usage for plain indexes compared to RT ones:

diskspace utilization by plain/rt indexes

The red color is RT indexes, the blue color is plain indexes.

As you can see RT indexes for 1M and 2M datasets take 20% more space than plain indexes, but the important thing here is that Sphinx plain indexes have to use extra space during reindexing (the peak usage can be up to 3 times more than their original size). So using sphinx real time indexes we can better utilize server diskspace.

Single query performance comparison

I’ve created a keywords list based on 1000 most popular words from the datasets and made search based on those 1000 keywords, here’s the result:

single queries plain/RT indexes comparison

The red color is RT indexes, the blue color is plain indexes, the yellow color is mixed indexes.

You can see that small indexes perform almost equally no matter whether they’re plain, RT or mixed while large RT indexes perform much worse than plain and mixed indexes (it can be tuned by different rt_mem_limit values, but this is a subject for another blog post)

Multi query performance comparison

I did exactly the same what I did in the previous test but with 5 Sphinx queries running simultaneously, here’s what I got:

multi queries plain/RT indexes comparison

The red color is RT indexes, the blue color is plain indexes, the yellow color is mixed indexes. The conclusion is the same as in the previous test РRT indexes perform worse on large datasets, but using multiqueries you can get the same or better  performance with RT indexes as you have with plain indexes w/o multiqueries (of course if your server is not overloaded by some other tasks which can lower this effect).

What can we say as a result?

  • RT indexes perform better only on small amount of data
  • For better performance you should design your application based on multiqueries
  • RT indexes is a good alternative to widely used incremenetal indexes that are usually small

Good luck!

Leave a comment

Your comment