This is a just a quick note I’d like to share since this might be confusing how one should do “group by” mva attribute in SphinxQL. Starting v 2.0.1-beta there’s a compat_sphinxql_magics directive which makes SphinxQL similar to standard SQL although there’re still few things that differ especially when it comes to MVA which is a bit unusual thing for standard SQL. Anyway back to the problem: if you have MVA attribute ‘tags’ with some integers and want to group by ‘tags’ your first intuitively written command will be probably like this:
mysql> select tags, count(*) c from idx where match('word') group by tags order by c desc limit 10;
+--------------+------+
| tags | c |
+--------------+------+
| 210,348 | 366 |
| 204 | 116 |
| 206 | 73 |
| 132,348 | 71 |
| 210,348 | 40 |
| 29 | 36 |
| 25,29,270 | 30 |
| 208 | 28 |
| 180 | 24 |
| 25,348 | 23 |
+--------------+------+
10 rows in set (0.00 sec)
But as you can see the first column contains few values instead of one that you want. To see the needed value used by Sphinx for grouping you need to use special @groupby word:
mysql> select @groupby, tags, count(*) c from idx where match('word') group by tags order by c desc limit 10;
+----------+--------------+------+
| @groupby | tags | c |
+----------+--------------+------+
| 348 | 210,348 | 366 |
| 204 | 204 | 116 |
| 206 | 206 | 73 |
| 132 | 132,348 | 71 |
| 210 | 210,348 | 40 |
| 29 | 29 | 36 |
| 270 | 25,29,270 | 30 |
| 208 | 208 | 28 |
| 180 | 180 | 24 |
| 25 | 25,348 | 23 |
+----------+--------------+------+
10 rows in set (0.00 sec)
Now it’s clear that the ‘tags’ was not changed during grouping at all, this is just what you have in one record our of the group while the value which the dataset was grouped by can be seen in the @groupby column.
Hope this will help someone to save some time when he faces this situation.
January 31st, 2012 in
Sphinx search engine,
Tips | tags:
GROUP BY,
MVA,
SphinxQL,
Tips Author: Sergey Nikolaev|
No Comments
Ivinco is a virtual company – we have no office, all our employees work from their homes. Being virtual allows us to keep our company costs low and it also allows us to hire talents all around the world. However, it’s very important for people who work together to meet in person regularly. So every year we organize company meetings in cool places where our staff can meet and spend some time together to work and have fun.
This year we went to Turkey, to a beautiful 5 star all-inclusive hotel on the cost of Mediterranean Sea. When we arrived Turkey met us with a near-hurricane weather, which was a bit dissapointing (but allowed us to concentrate on our work plans better
), but after a few days it was all nice and sunny and we were able to relax on the beach and took one day-off our schedule for a boat trip near the old Alanya city with breathtaking nature.

Read the rest of this entry »
December 9th, 2011 in
News | tags:
meeting Author: Sergey Nikolaev|
No Comments
If you are using the Sphinx server with many indexes and you decide to move to real-time indexes then this article is for you. I will describe how to simply convert a large number of original indexes to real-time using indexer and the new Sphinx command ATTACH INDEX.
Before Sphinx version 2.0.2-beta was available, the only way to update real-time (RT) indexes was to use the SphinxQL commands (INSERT/DELETE/REPLACE) which needed to be executed through the MySQL protocol to update RT indexes. At the same time, the original indexes used the indexer tool, which made updating indexes very simple. However, this kind of tool doesn’t exist for real-time indexes.
The only solution was to write a custom script.
Why isn’t a custom script good?
If we compare a custom script to the indexer: indexer was written in C++, it was very well optimized, and tested by many users on different amounts of data. If you want to write a custom script, do so using the language you know best, probably PHP/Ruby/Python, you will need to take care of:
- long term stability
- memory leaks
- data preparation and escaping
- performance optimization, nobody would wait a month to index a terabyte of data
Furthermore, this kind of script will load searchd which causes RT indexes to use a massive amount of memory. For example, 300 indexes with 100mb rt_mem_limit could require up to 30GB of free RAM.
Good news! Sphinx version 2.0.2-beta has added a new command which allows you to convert original indexes into real-time indexes. This command looks like: ATTACH INDEX diskindex TO RTINDEX rtindex.
After a successful ATTACH, the data originally stored in the source disk index becomes a part of the target RT index, and the source disk index becomes unavailable (until the next rebuild). ATTACH does not result in any index data changes. Basically, it just renames the files (making the source index a new disk chunk of the target RT index), and updates the metadata. So it is a generally quick operation which might (frequently) complete as fast as under a second.
So, now using the MySQL protocol you can convert the original index into a real-time one. In order to achieve this, you could write custom script which will execute the attach command for each index using MySQL client, but I found a better way.
I suggest solving this task in a more traditional way using the indexer tool.
To keep the example simple, I will show you how to convert just one index, you can use the code below to extend your sphinx.conf.
Read the rest of this entry »
December 2nd, 2011 in
Sphinx search engine | tags:
Sphinx Search Author: Yaroslav Vorozhko|
No Comments
Hi. Here’s just a simple trick which can be useful if you run some long-lasting mysql/sphinx query in a screen or whatever and want to get informed as soon as it’s finished. This works on mysql client level and applicable to MySQL and SphinxQL:
The trick is to use “pager” directive to redirect the output to a mail program:
mysql> pager mail -s "subject" yourname@yourdomain.com
PAGER set to 'mail -s "subject" yourname@yourdomain.com'
The above will redirect the output to yourname@yourdomain.com and the subject will be “subject”.
Then just start your long-lasting query to MySQL:
mysql> select count(*) from feed where ext_key like 'a%';
1 row in set (10 min 35.88 sec)
or SphinxQL:
mysql> select * from huge_index;
20 rows in set (1 min 14.00 sec)
and be informed via email.
November 22nd, 2011 in
Tips | tags:
Tips Author: Sergey Nikolaev|
No Comments
The Sphinx team recently published new article about Sphinx memory consumption.
It provides the formula to estimate the memory consumption for RT indexes:
For RT-index you can estimate memory consumption by calculating the size of all on-disk chunks (minus .spd & .spp sizes as noted above) plus RAM-chunk size (rt_mem_limit)
The formula looks good, but notice that they also added size of RAM-file into it.
What is that RAM-file?
The RAM-file is used to store all necessary data for Sphinx RT-indexes.
Sphinx keeps the RAM-file in memory to support real time updates for RT-indexes.
Before the RAM-file data is copied to a general chunk file, the RAM-file should reach rt_mem_limit in size.
Each RT-index has a personal RAM file.
So, what’s wrong with the RAM-file?
Imagine we have 30 indexes each 3Gb in size.
If we want to keep number of index-chunks low, i.e. less than 5, we need to set rt_mem_limit to 1Gb.
In this case we will have 3 chunks for each index.
Now let’s estimate how much memory we need to support this configuration.
For 30 indexes we will have 30 RAM-files multiplied by 1Gb (rt_mem_limit) which will make 30Gb.
30Gb of free memory is required to support our configuration and that’s without counting .spa and .spi files.
Of course in a real system with random data distribution Sphinx will consume much less memory, probably 1.5 times less.
Ok, lets decrease rt_mem_limit 10 times to 100Mb.
In this case we will get 3Gb memory required, but the number of chunks for each index will grow from
3 to 30 (overall it is 900 chunks for 30 indexes).
Now let’s imagine how fast Sphinx will query each of 30 or more chunks.
It will be very slow, because of many disk I/O operations, especially if we query more than one index at a time!
Conclusion
Low rt_mem_limit is good for memory, but hurts Sphinx performance.
High rt_mem_limit could gives good performance, but requires a lot of free memory.
The Sphinx team definitely needs to optimize this feature, i.e. add a new option to control RAM-file size (i.e. rt_RAM_limit).
rt_RAM_limit in conjunction with rt_mem_limit could give more flexibility to better setup the massive number of RT-indexes
on an average server with 16Gb of memory on board.
Another option is to exclude .spp and .spd data from RAM-file, so it keeps only those things in memory that are supposed to be there – .spa and .spi.