Should you switch to Sphinx real time indexes?

Regular indexes problem

The main inconvenience of regular indexes is their update speed. In order to update one you should entirely rebuild it.

For large amounts of data we usually use main+delta indexes. The main contains the most of the data, and delta — only recent changes. So to keep whole index up-to-date we should rebuild delta index every 3-5 minutes. But the larger delta grows the longer it takes to rebuild it. That’s why we recommend to flush delta into main every day or week depending on your data growth rate.

But this approach has a couple of drawbacks:

  • high average load of the system due to frequent index rebuilding;
  • fresh data will be indexed only in several minutes in worst case.

Real-time indexes

Sphinx 1.10 introduces real-time index support. The main idea of it is the ability to insert and update index records on-the-fly. RT indexes are compatible with MySQL protocol which allows us to use existing MySQL client apps for work with them using SELECT, DELETE, INSERT and REPLACE operators.

Currently there are some performance issues with real-time indexes on large data sets. But for smaller ones (say, up to 500.000 Wikipedia documents) they show comparable to regular indexes speed.

So real-time indexes performance and simplicity makes them a preferable choice for storing relatively small but frequently changing data index.

Conclusion

Real-time indexes can be used as a replacement of main+delta regular index bundle. They can reduce server workload and simplify index updating routine.

Also we can make use of mixed indexes to plug in real-time indexes to the existing app.

Here’s the example of mixed index:

index distributed
{
type = distributed
local = plain_main_index
local = real_time_increment_index
}

In this example we connect to both regular and real-time indexes using one distributed index. This way migration to real-time indexes can be performed seamlessly without significant modifications of production system.

Good luck and have fun with real-time indexes.

4 Comments

Shankar KrishOctober 25th, 2012 at 4:15 pm

Hello,
This comment is more than 2 years old. You have commented “Currently there are some performance issues with real-time indexes on large data sets.”
Would it be a fair assumption that the current version has addressed these performance issues.
Looking at the forum on Sphinx, i have not found any posting indicating performance issues with the real time indexes in the recent versions.
Would appreciate your thoughts/comments on the performance aspect of RT indexes.

Thanks & Regards
Shankar Krish

Sergey NikolaevOctober 26th, 2012 at 3:30 am

Hello Shankar

Please read this our post http://www.ivinco.com/blog/sphinx-in-action-good-and-bad-in-sphinx-real-time-indexes/. The RT indexes can actually perform just as good as traditional indexes, it’s all about high enough rt_mem_limit value, but the cost of the good RT indexes performance level is possible excessive RAM consuming.

MichaelApril 6th, 2019 at 3:54 pm

How would one determine to use RT vs Delta?

Currently I have a Sphinx Index built on ~2 Million Records which I get from a data source(s) each day. I’ve been doing ‘Kill and Fill’ meaning I empty the db and populate it with the data feed and reindex each day. Which creates a massive delay between the actual DB and the Sphinx Queries.

The good news is I only need to update once a day and relatively few records get added, deleted or modified.

In a scenario like that would one be preferable to the other?

Sergey NikolaevApril 8th, 2019 at 4:28 am

> How would one determine to use RT vs Delta?
In general if you want to make your new data searchable instantly you use RT, if you have to rebuild lots of data regularly you use plain. In the both cases there may be exceptions, e.g. if you do multi-inserts you can get decent indexation performance and also if you rebuild your delta each few seconds and have a well designed set of multiple delta indexes (each next covering bigger interval than the previous) then a new document may become searchable in few seconds. It also depends on changes in your application (to do INSERTs/UPDATEs/DELETEs from there) vs a will to maintain crontasks (to do regular delta index rebuild), so there’re many factors. But in most cases I recommend first to try with an RT index as it seems to be easier to maintain:
– no crontasks for indexation
– no complex multiple-deltas schema
– docs are seachable instantly

The only thing you need to worry about is doing OPTIMIZE from time to time to merge your disk chunks into a bigger one to get max performance. And of course you need to teach your app doing write queries.

Leave a comment

Your comment