March 29, 2010 ・ Basics
How To Improve Sphinx Indexing Performance
Sphinx indexing often can be very “heavy” process for the system – it uses a lot of CPU and I/O resources. Rebuilding and merging indexes require twice as much disk space than the size of the index. Indexing can slow down other tasks on the server such as search or the performance of SQL queries. If you host your website on the same server indexing process can slow it down significantly. Let’s look at some ways to optimize Sphinx indexing to avoid such problems.
Data source related optimizations
Sphinx fetches documents from the specified source (such as MySQL or PostgreSQL) and splits the text into words to build indexes. This process usually can grow in size and occupy all the disk space. I won’t talk about database optimization here but make sure the SQL queries that select data for the index work fast. Here are some trivial things to look for:
Only index the fields, which you need to search. Do not include any large text fields which you will not be searching in.
If you have your texts compressed in your MySQL DB you can save some CPU on the db side/network if you use unpack_mysqlcompress Sphinx directive. This tells Sphinx to do the uncompressing on Sphinx side.
Keep the length of the index as low as possible. Say you have a large product catalog and you need search only for products which were added not earlier than a year ago – make sure your indexer SQL query has the corresponding WHERE condition.
When it makes sense break apart large indexes into smaller ones. For example: yearly data can be broken down by quarters to four indexes. Let’s say you know at what time which category of products was added – then you can search this data in only one of the four indexes. And you can use distributed index to search within year old data, this will include all four quarterly indexes.Note: Before deciding which approach to use: one monolithic index, or several distributed ones analyze the queries that your application will be sending to Sphinx. If the majority of queries will require data for entire scope you probably should not distribute the index – in some cases Sphinx performance may get worse if you split index into too many parts. Read more below to find out how the use of distributed indexes may affect the overall performance of queries.
Use ranged queries. With ranged queries Sphinx will be fetching the data in small steps, this significantly reduces the load on database and requires less memory.
Consider attribute sizes
In Sphinx each attribute by default takes 32 or 64 bits. There is an option to define specific size for the attributes. Knowing that Sphinx is usually used for indexing at least millions of documents this can save your disk space. So limit the size of attributes where possible. For example: the number of countries does not exceed the size of 16 bits and we can limit the attribute to 2 bytes. This reduces the size of attribute 2 times on a 32-bit platform and 4 times on a 64-bit box, which will significantly reduce the overall size of the index and often the size of Sphinx RAM usage.
Note: you can set size for all attributes this way, but make sure you monitor it so your data won’t go beyond the limits.
Performance of distributed indexes
You should know that performance of distributed index depends on the speed of the slowest agent. To avoid problems you should regularly test each agent and redistribute the load to other agents if needed. In a larger system with thousands of indexes, dozens of Sphinx agents and multiple servers it may be difficult to find the “slow agent”. But you can get a lot of information about the speed of queries from Sphinx logs. Regularly test Sphinx queries and analyze the logs.
When re-indexing, option inplace_enable allows to halve the disk space required for the indexer. But note that this option will slow down indexing speed by 5-10%. This option is useful on servers with a very limited amount of disk space. By default this option is disabled (and it only affects the indexer, not affecting searchd in any way).
With mem_limit option you can control the amount of memory allocated for indexing. The best option is to use 256Mb to 1024Mb. Too low values may lead to reduced speed of indexing. Too high can lead to MySQL timeouts. Since Sphinx will have to handle large buffer pool you may lose the connection. By default this parameter is set to 32Mb and maximum value is 2047Mb.
Option max_iops affects the number of I/O operations per second. If during indexing search performance is significantly reduced, it’s worth to set max_iops. By default this option is set to 0, i.e. unlimited. Also consider the max_iosize option – it allows you to set the size of the read/write buffer for indexing. By default this option is also 0, i.e. buffer size is not limited.
Note: you can use fio utility to determine the number of I/O operations per second and see the size of I/O cache. You can also find some stats using vmstat/iostat/dstat. Once you have identified the characteristics of your disk try to test the indexer and searchd with different max_iops and max_iosize settings.
I hope this has provided you with some ideas for the tweaking you can do to optimize Sphinx indexing process. To sum up,
Write your indexer SQL query carefully, do not include all the fields – choose only those that will be needed for search and use WHERE to not overload the index with unnecessary data.
Consider breaking up large monolithic index into several smaller ones. Use some criterion e.g. quarter or year.
Use ranged queries it will reduce the chance of locking the database during indexing.
Define custom bit sizes for your small attributes.
Test your Sphinx queries and analyze the logs.
Do not forget to check the Sphinx documentation for complete descriptions of the mentioned options and tips.