Five ways to configure Sphinx search engine

If you ever used Sphinx Search you’ve probably tried one of the configurations listed below. Each of them is suitable for particular types of projects. Let’s take a deeper look at them.

Before we start I’d like to point that when deciding on Sphinx Search configuration for your project you should ask following questions:

How large is the amount of data you want to make searchable?
How fast is data growth in your system?
What are your hardware capabilities (number of CPUs, memory, network)?
How many search queries your system needs to serve?

In the examples of Sphinx Search configuration below I tried to explain the advantages and disadvantages of each.

1. Single index

Single index configuration is the simplest one. It is suitable for most of the internet sites and projects with up to 100,000 or so pages. Content like posts, comments, or pages can be put into one single index using a simple SQL query to get necessary data fields. To have index always updated you should implement a periodic re-indexing via cron and here all the work for setting up Sphinx ends.

Advantages:

Simple to configure
Simple to use
It comes almost out of the box with Sphinx Search

Disadvantages:

Requires full data re-indexing to keep up with data changes and new content
With larger (>100,000 documents) projects data re-indexing might take too long, making this configuration unsuitable

2. Main + delta scheme

This is configuration scheme with two indexes – goal of this scheme is to make fast indexes updates as easy as possible with Sphinx Search even for larger amounts of data.

When you have more documents (say more than 100,000 pages or so) and the amount of your data is continuously growing with frequent content updates (e.g. a large forum or a news website), it is best to implement so called “Main+Delta” scheme which uses two indexes. Main – as core index with most of the data – it will be updated not so frequently and will grow in size over time. And Delta – as an incremental index which will contain only the latest new information which is not covered by Main index yet.

To ensure we have the latest documents searchable we need to rebuild indexes very often. And with large amounts of data it is not possible to perform full re-index as it can take several hours or even days. But having this configuration we need to rebuild only Delta index frequently which will take seconds or minutes and your search engine will always have fresh data.

To keep Delta index small you have to append it to the Main index and reset it periodically. Since Sphinx version 0.9.9 you can merge indexes so it is not necessary to rebuild Main index each time – you can just merge Main and Delta.

To use the search you should query both indexes:

$sphinxClient->Query(“your search query”, “main delta”);

Hint: You can also use a distributed index which will unite Main and Delta indexes.

Advantages:

It is still simple to configure
Sphinx index updates are fast
It comes almost out of the box with Sphinx Search

Disadvantages:

Requires more time to configure
Requires to periodically merge Main and Delta indexes

3. Multiple indexes

You may need this configuration in cases when you have multiple data sources and you need to implement a search for all of them. This also is useful when you use sharded database. When you build your indexes you can unite them using distributed index and use it for the search queries.

This scheme allows you to search in all indexes as well as separately in each.

When you use Sphinx to search in all indexes you must make sure that the IDs of all the indexes included in a distributed index are unique to not overlap.

Advantages:

Simple search among several databases even on different servers

Disadvantages:

Requires to support uniqueness of index IDs to prevent data overlap
May require special script to generate Sphinx configuration file

4. Multiple Sphinx instances

Note: Starting with version 1.10-beta Sphinx supports multithreading by itself, so this should solve the performance problems we are talking about in this section.

To use the full power of multi-core CPU it makes sense to run multiple instances of Sphinx. For this you need to implement the scheme of multiple indexes where each index is responsible only for their portion of data. For example to use four-core CPU one of the options is to cover the data with 4 indexes where each index is used by a separate searchd process. Then each Sphinx instance will use one of the CPU cores.

For each Sphinx instance you will have to use it’s own SQL query for indexing with a condition to select 1/4 of all data.

We will also need a fifth (master) Sphinx instance which will use the distributed index to unite all other instances.

Hint: Master Sphinx instance can be placed within one of the data instances to reduce number of Sphinx instances

Another way to implement this is to use the system of Sphinx agents within one Sphinx instance accessing itself. This way you’ll achieve multi processing that we need here.

Advantages:

High search performance on multi-cpu systems
Good scaling opportunity for your site/project

Disadvantages:

Not easy to implement
May require special script to generate Sphinx configuration file(s)

Hint: Putting multiple indexes on separate disks can give a good advantage in speed of reading data and indexing performance.

5. Sphinx Search Cluster

Sphinx cluster is a set of multiple servers with the configuration described in section four.

In this configuration each server has it’s master Sphinx instance and one of the servers is chosen as a forwarder. Forwarder’s task is to distribute requests between all servers using distributed indexes to send each request to the master instance of each server.

You can also implement this without master instances on each server but in such case the forwarders need to know about all the instances on every server in the cluster.

Using the Sphinx cluster consider your network speed – sometimes the network can become a bottleneck of the whole cluster system.

This configuration allows to enable fast search for very large amounts of data, but it is not that simple to implement – you will need to develop a solid framework for data distribution among all the instances in the cluster.

Advantages:

High search performance
Very scalable configuration

Disadvantages:

Not easy to implement
Requires special script to generate Sphinx configuration files
Search speed depends on the network speed (if the servers are not in the same LAN)

These are of course not all possible Sphinx configurations but if you are going to deploy Sphinx search engine for the first time – you should definitely consider the above methods.