Sphinx: building 1M docs index having no one real doc

Hi guys

Just want to share an interesting trick on how to easily index something with Sphinx without need of populating database with a lot of data or doing smth like that. The below is a full Sphinx config which lets you build a 1M docs index consisting of random 3-char words and one numeric attribute with a random value. All you need is just any connection to any db (in this case ‘mysql -u root’ works).

source min
{
    type = mysql
    sql_host = localhost
    sql_user = root
    sql_pass =
    sql_db = test
    sql_attr_uint = attr
    sql_query_range = select 1, 1000000
    sql_range_step = 1
    sql_query = select $start, mid(md5(rand()), 1, 3) body, floor(rand() * 100000 * $end) attr
}
index idx
{
    path = idx
    source = min
}
searchd
{
    listen = 3306
    log = sphinx.log
    pid_file = sphinx.pid
}

As you can see the tricky part is to utilize Sphinx’ directives sql_query_range and sql_range_step to let Sphinx loop until it makes 1M docs collection. The drawback is slower indexing comparing to real fetching the same amount of data from db, but come on, you’re not going to use this in production, right?

I hope you’ll find it helpful when you decide to play with Sphinx.

Correct ordering of search result set using Sphinx

Once you get results from Sphinx, it is important to maintain the way they are ordered. The typical search process used by Sphinx looks like the following:

Searching API

  • Searches in Sphinx index
  • gets id’s of matching records
  • puts ids into MySQL query to get the rest of the information.

Read the rest of this entry »

Sphinx in Action: Top related queries

This is a post from a series of posts called “Sphinx in Action”. See the whole series by clicking here.

Many of our customers want to improve their search engine by allowing users to refine their searches by showing them top or related queries from previous days or weeks. When they click on these links, they are taken to a corresponding search results page.

This works by storing all user queries to your database and then indexing them for the last N days with Sphinx. Then you simply make a query against the index to find the most popular related queries that match (or are similar) to the current query.
Here are a few tricks to add these features in the most efficient way:
Read the rest of this entry »

Sphinx in Action: Fuzzy matching and 2nd pass query

This is a post from a series of posts called “Sphinx in Action”. See the whole series by clicking here.

Many customers that we have helped with integrating search into their applications wanted their search to be more intelligent than just strictly matching a query with documents.
There are many ways to do this. Sphinx makes it very easy, as fuzzy matching is included out of the box. It consists of two main components:
Read the rest of this entry »

Sphinx in Action: How Sphinx handles text during indexing

This is a post from a series of posts called “Sphinx in Action”. See the whole series by clicking here.

When integrating full-text search into your application it’s important to think about tokenizing your text, or how it gets split up into words. This is the first process your data goes through once you’ve built the config and begun indexing.

Sphinx includes a variety of settings that give you full control of the way your text gets tokenized. They are:

Read the rest of this entry »

←Older