<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ivinco</title>
	<atom:link href="http://www.ivinco.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ivinco.com</link>
	<description>Advanced Web Development Services</description>
	<lastBuildDate>Fri, 03 May 2013 22:32:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Sphinx: building 1M docs index having no one real doc</title>
		<link>http://www.ivinco.com/blog/sphinx-building-1m-docs-index-having-no-one-real-doc/</link>
		<comments>http://www.ivinco.com/blog/sphinx-building-1m-docs-index-having-no-one-real-doc/#comments</comments>
		<pubDate>Tue, 12 Feb 2013 07:06:47 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[Configuration]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Sphinx]]></category>
		<category><![CDATA[Sphinx Search]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1635</guid>
		<description><![CDATA[Hi guys Just want to share an interesting trick on how to easily index something with Sphinx without need of populating database with a lot of data or doing smth like that. The below is a full Sphinx config which lets you build a 1M docs index consisting of random 3-char words and one numeric [...]]]></description>
			<content:encoded><![CDATA[<p>Hi guys</p>
<p>Just want to share an interesting trick on how to easily index something with Sphinx without need of populating database with a lot of data or doing smth like that. The below is a full Sphinx config which lets you build a 1M docs index consisting of random 3-char words and one numeric attribute with a random value. All you need is just any connection to any db (in this case &#8216;mysql -u root&#8217; works).</p>
<pre class="brush:bash">
source min
{
    type = mysql
    sql_host = localhost
    sql_user = root
    sql_pass =
    sql_db = test
    sql_attr_uint = attr
    sql_query_range = select 1, 1000000
    sql_range_step = 1
    sql_query = select $start, mid(md5(rand()), 1, 3) body, floor(rand() * 100000 * $end) attr
}
index idx
{
    path = idx
    source = min
}
searchd
{
    listen = 3306
    log = sphinx.log
    pid_file = sphinx.pid
}
</pre>
<p>As you can see the tricky part is to utilize Sphinx&#8217; directives sql_query_range and sql_range_step to let Sphinx loop until it makes 1M docs collection. The drawback is slower indexing comparing to real fetching the same amount of data from db, but come on, you&#8217;re not going to use this in production, right?</p>
<p>I hope you&#8217;ll find it helpful when you decide to play with Sphinx.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-building-1m-docs-index-having-no-one-real-doc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Correct ordering of search result set using Sphinx</title>
		<link>http://www.ivinco.com/blog/correct-ordering-of-search-result-set-using-sphinx/</link>
		<comments>http://www.ivinco.com/blog/correct-ordering-of-search-result-set-using-sphinx/#comments</comments>
		<pubDate>Fri, 01 Jun 2012 11:14:53 +0000</pubDate>
		<dc:creator>Nikita</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[results ordering]]></category>
		<category><![CDATA[Sphinx]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1485</guid>
		<description><![CDATA[Once you get results from Sphinx, it is important to maintain the way they are ordered. The typical search process used by Sphinx looks like the following: Searching API Searches in Sphinx index gets id’s of matching records puts ids into MySQL query to get the rest of the information. In order to demonstrate this, [...]]]></description>
			<content:encoded><![CDATA[<p><strong><strong>Once you get results from Sphinx, it is important to maintain the way they are ordered. The typical search process used by Sphinx looks like the following:</strong></strong></p>
<p>Searching API</p>
<ul>
<li>Searches in Sphinx index</li>
<li>gets id’s of matching records</li>
<li>puts ids into MySQL query to get the rest of the information.</li>
</ul>
<p><span id="more-1485"></span><br />
<strong id="internal-source-marker_0.4705976468976587"><br />
In order to demonstrate this, I have used the following table as my development environment:</strong></p>
<pre class="brush:bash">
`some_table` (
`id` int(10) unsigned NOT NULL auto_increment,
`some_text` text,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
</pre>
<p>The table was filled by:</p>
<pre class="brush:bash">
mysql&gt; select * from some_table;
+----+----------------+
| id | some_text |
+----+----------------+
| 2 | test text |
| 4 | text test |
| 6 | something else |
| 8 | new row |
| 10 | new text |
| 12 | empty field |
| 14 | old text |
| 16 | result row |
| 18 | another row |
+----+----------------+
</pre>
<p>The Sphinx config file looks like:</p>
<pre class="brush:bash">
source min
{
type = mysql
sql_host = localhost
sql_user =
sql_pass =
sql_db =
sql_query = select * from some_table;
}

index idx_min
{
path = idx/idx
source = min
}

searchd
{
listen = 39306:mysql41
log = logs/sphinx.log
pid_file = sphinx.pid
}
</pre>
<p>So, we get the results by using SphinxQL:</p>
<pre class="brush:bash">
mysql&gt; select * from idx_min where match ('text | new') ;
+------+--------+
| id | weight |
+------+--------+
| 10 | 1588 |
| 8 | 1568 |
| 2 | 1520 |
| 4 | 1520 |
| 14 | 1520 |
+------+--------+
5 rows in set (0.00 sec)
</pre>
<p>If those ids are put directly into the query, the results won’t keep the same order from the source (in this case relevance):</p>
<pre class="brush:bash">
mysql&gt; select * from some_table where id in (10,8,2,4,14);
+----+-----------+
| id | some_text |
+----+-----------+
| 2 | test text |
| 4 | text test |
| 8 | new row |
| 10 | new text |
| 14 | old text |
+----+-----------+
</pre>
<p>In this situation, the MySQL condition ‘ORDER BY FIELD’ can be helpful:</p>
<pre class="brush:bash">
mysql&gt; select * from some_table where id in (10,8,2,4,14) ORDER BY FIELD(id, 10,8,2,4,14);
+----+-----------+
| id | some_text |
+----+-----------+
| 10 | new text |
| 8 | new row |
| 2 | test text |
| 4 | text test |
| 14 | old text |
+----+-----------+
</pre>
<p>By using the method outlined in this brief tutorial, you can keep the order of the results.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/correct-ordering-of-search-result-set-using-sphinx/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sphinx in Action: Top related queries</title>
		<link>http://www.ivinco.com/blog/sphinx-in-action-top-related-queries/</link>
		<comments>http://www.ivinco.com/blog/sphinx-in-action-top-related-queries/#comments</comments>
		<pubDate>Fri, 13 Apr 2012 06:36:06 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Real Time indexes]]></category>
		<category><![CDATA[RT indexes]]></category>
		<category><![CDATA[Sphinx in action]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1084</guid>
		<description><![CDATA[This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by clicking here. Many of our customers want to improve their search engine by allowing users to refine their searches by showing them top or related queries from previous days or weeks. When they click on these links, they [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by <a title="Sphinx in Action" href="http://www.ivinco.com/blog/tag/sphinx-in-action/">clicking here</a>.</p>
<p>Many of our customers want to improve their search engine by allowing users to refine their searches by showing them top or related queries from previous days or weeks. When they click on these links, they are taken to a corresponding search results page.</p>
<p>This works by storing all user queries to your database and then indexing them for the last N days with Sphinx. Then you simply make a query against the index to find the most popular related queries that match (or are similar) to the current query.<br />
Here are a few tricks to add these features in the most efficient way:<br />
<span id="more-1084"></span></p>
<ol>
<li>Use the query text hash as the id, this will allow you to avoid grouping and therefore decrease the response time, Sphinx will ignore duplicated IDs during the indexing stage or when you insert (if you are using a real time index).</li>
<li>Some similar queries might differ slightly, but not significantly (eg. &#8216;MYSQL innodb&#8217;, &#8216;mysql innodb&#8217; and &#8216;mysql, innodb&#8217; mean the same). You probably want all of them to be merged into one. To do this, you need to normalize the texts. This is where you can use the &#8216;call keywords()&#8217; command (which is named BuildKeywords() in Sphinx API), this can help you tokenize the query exactly the same way Sphinx will tokenize it while indexing, all you need to do is concatenate the returned keywords and you will have the query cleaned from everything insignificant.</li>
</ol>
<p>So your index structure would be like this:</p>
<pre class="brush:bash">
id: hash (normalized query)
field: normalized query
attr: count
attr: id in db
</pre>
<ol>
<li>You can use the main+delta scheme to make indexing faster.</li>
<li>You can use UpdateAttributes() (or corresponding SphinxQL &#8216;UPDATE&#8217; query) to increment the query hit count and index less frequent. This is especially useful if you don&#8217;t have a lot of new queries coming and mainly you need to update the existing ones. Remember one thing in this case: if you have a lot of hits per second you might encounter a collision issue since Sphinx cannot increment attributes instantly, it can only update it, so you will have to increment on the application side and if you do it simultaneously in a few processes you could end up with less increments than expected.</li>
<li>If you have a lot of memory and performance is more important for you than memory consumption you can put the query in its original state (before the normalization) into a string attribute. This way you won&#8217;t need to make an additional query to the db to find the query text and also you won&#8217;t need to store the query id in the index. The structure in this case is:</li>
</ol>
<pre class="brush:bash">
id: hash (normalized query)
field: normalized query
attr: count
attr: original query
</pre>
<p>Since the queries index is usually not very large you can use a real time index and then you won&#8217;t need to reindex it at all. If you do this, you might also want to periodically delete older queries. This is a good way to make the id based not just on hash(query), but also on the hash(query + date) and the hash(query) which will be an additional attribute as well as &#8216;date&#8217;.</p>
<p>So your index structure will be:</p>
<pre class="brush:bash">
id: hash (normalized query concatenated with date)
field: normalized query
attr: count
attr: hash(normalized query)
attr: date
attr: id in db / original query
</pre>
<p>Then you need to do two more things in the Sphinx query:</p>
<ol>
<li>Filter out older queries</li>
<li>Group by the hash (normalized query) and sort by sum(count)</li>
</ol>
<p>One more thing to mention about related queries is that it makes sense to include only queries that actually produce results. People don&#8217;t like to click and be taken to &#8220;nothing is found&#8221; page. The problem here is there might be a query that previously returned results, but doesn&#8217;t find them any longer. You may have deleted them from your main search index. If you&#8217;re worried about this, it makes sense to do two things:</p>
<ol>
<li>Add one more attribute &#8216;last_total_found&#8217; to the queries index and update it along with updating the &#8216;count&#8217; attribute. Thus the &#8220;nothing is found&#8221; issue can happen only for the first user who clicks on the link.</li>
<li>Periodically recheck of all your queries to be sure they&#8217;re still producing results even though this might take additional resources.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-in-action-top-related-queries/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Sphinx in Action: Fuzzy matching and 2nd pass query</title>
		<link>http://www.ivinco.com/blog/sphinx-in-action-fuzzy-matching-and-2nd-pass-query/</link>
		<comments>http://www.ivinco.com/blog/sphinx-in-action-fuzzy-matching-and-2nd-pass-query/#comments</comments>
		<pubDate>Fri, 13 Apr 2012 06:17:41 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[Optimization]]></category>
		<category><![CDATA[Sphinx in action]]></category>
		<category><![CDATA[Tuning]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1070</guid>
		<description><![CDATA[This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by clicking here. Many customers that we have helped with integrating search into their applications wanted their search to be more intelligent than just strictly matching a query with documents. There are many ways to do this. Sphinx makes [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by <a title="Sphinx in Action" href="http://www.ivinco.com/blog/tag/sphinx-in-action/">clicking here</a>.</p>
<p>Many customers that we have helped with integrating search into their applications wanted their search to be more intelligent than just strictly matching a query with documents.<br />
There are many ways to do this. Sphinx makes it very easy, as fuzzy matching is included out of the box. It consists of two main components:<br />
<span id="more-1070"></span></p>
<ol>
<li>Quorum operator:
<pre class="brush:bash">
"computing and technology news"/2
</pre>
<p>This means that at least two words of the phrase should match, i.e. this query would find texts containing both &#8220;computing news&#8221; and &#8220;technology news&#8221;.
</li>
<li>Proximity search operator:
<pre class="brush:bash">
"computing news"~3
</pre>
<p>This means that there can be less than N non-matching words between the words from the query. Here are a few examples:<br />
If the text is &#8220;a b c d e f g h&#8221; :</p>
<ul>
<li>&#8220;a h&#8221;~7 would find that since &#8220;b c d e f h&#8221; are 6 non matching words and this is less than 7<br />
however &#8220;a h&#8221;~6 wouldn&#8217;t find the text</li>
<li>&#8220;a d h&#8221;~6 would find that</li>
<li>&#8220;a d h&#8221;~5 wouldn&#8217;t</li>
</ul>
</li>
</ol>
<p>Many of our customers like the second pass logic when the first query to Sphinx is strict. If nothing is found or not enough results are returned, the second less strict query is issued. There also may be more advanced logic with 3rd and 4th passes. It just depends on your requirements and whether or not you want your user to at least find something in any case. Or, on the contrary, you can let them find something that exactly matches his query.</p>
<p>Sometimes it makes sense to do it the other way around and make the query stricter. For example, if your default matching strategy is &#8216;any word should match&#8217; and your application doesn’t have any extended syntax to allow users to specify the best query themselves, it makes sense to first try the &#8216;all words should match&#8217; strategy or even the &#8216;phrase should match&#8217;. This might significantly increase the quality of the search.</p>
<p>In some applications it makes sense to parallelize the 1st/2nd pass queries and so on. This can be easily done using Spinx multiquery. The logic here is to do the 2nd pass query beforehand and if nothing is found by the 1st pass query the results will be ready and since the queries were done simultaneously this improves performance. However, this depends on a lot of things. You should be careful, as it can reduce performance sometimes. These things are:</p>
<ul>
<li>What hardware are you using? If it&#8217;s not powerful enough to handle two queries at once or the response time is close to the response is double that of a single query, it makes little sense to use this technique.</li>
<li>What is your load? If your Sphinx is already heavily loaded, you will get a worse response time.</li>
<li>What are your statistics? If for 99% of queries the 1st pass the query returns results, there&#8217;s little point to make the 2nd pass query along with the 1st one, this will just waste resources.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-in-action-fuzzy-matching-and-2nd-pass-query/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sphinx in Action: How Sphinx handles text during indexing</title>
		<link>http://www.ivinco.com/blog/sphinx-in-action-how-sphinx-handles-text-during-indexing/</link>
		<comments>http://www.ivinco.com/blog/sphinx-in-action-how-sphinx-handles-text-during-indexing/#comments</comments>
		<pubDate>Fri, 13 Apr 2012 06:05:09 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[Configuration]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Sphinx in action]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1066</guid>
		<description><![CDATA[This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by clicking here. When integrating full-text search into your application it&#8217;s important to think about tokenizing your text, or how it gets split up into words. This is the first process your data goes through once you&#8217;ve built [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by <a title="Sphinx in Action" href="http://www.ivinco.com/blog/tag/sphinx-in-action/">clicking here</a>.</p>
<p>When integrating full-text search into your application it&#8217;s important to think about <em>tokenizing</em> your text, or how it gets split up into words. This is the first process your data goes through once you&#8217;ve built the config and begun indexing.</p>
<p>Sphinx includes a variety of settings that give you full control of the way your text gets tokenized. They are:</p>
<p><span id="more-1066"></span></p>
<ul>
<li><strong>charset_table and ngram_chars;</strong> lets you define characters that will be treated as normal characters, everything else will be a separator.<br />
You can also use:</p>
<ul>
<li>Ranges: a..z</li>
<li>Char mapping: A-&gt;a</li>
<li>Range mapping: A..Z-&gt;a..z</li>
<li>Char codes: U+410..U+42F</li>
</ul>
</li>
<li><strong>ngram_chars;</strong> If you have to deal with chinese/japanese/korean text whose structure differs significantly from other languages, this can be useful as it allows you to insert a separator after each hieroglyph and instead of one long word you will have each hieroglyph indexed as a separate token, the same will happen with the query allowing your users to find what they&#8217;re looking for.<br />
Example:<br />
ngram_chars = U+3000..U+2FA1F</li>
<li><strong>ignore_chars;</strong> lets you totally ignore some characters.</li>
<li><strong>blend_chars;</strong> allows you make some characters both separators and normal characters as well. For example, if you have twitter-related things like @username in your data and want to allow users to search exactly for the twitter nicks you can easily do so by adding &#8216;@&#8217; into the blend_chars, then the query &#8216;@username&#8217; wouldn&#8217;t find text &#8216;username&#8217; while the query &#8216;username&#8217; would still find them both.<br />
Example:<br />
blend_chars = @</li>
</ul>
<p><strong>Exceptions, wordforms and stopwords</strong></p>
<p>That concludes the character level commands. The next process the text goes through while getting indexed is the word handling level controlled by stopwords, word length directives, exceptions and wordforms.</p>
<ul>
<li>By using stopwords, you can make one or more lists of words that will not be indexed at all. They may be very frequently used words such as &#8216;a&#8217;, &#8216;the&#8217;, &#8216;and&#8217;, etc&#8230; There are a few goals for this: Firstly, it improves the search quality because once these words are excluded from the index and from the query, the latter will become more agile. For example if you have &#8216;on&#8217;, &#8216;a&#8217;, &#8216;the&#8217; and &#8216;my&#8217; in the stopwords, the query &#8216;search on my site&#8217; will also find such phrases as &#8216;search on a site&#8217;, &#8216;search on the site&#8217;, &#8216;search my site&#8217; etc. Secondly, it decreases the size of your index because indexing something which exists in almost all documents will take more resources and thus slow down performance.</li>
<li>The related directive used to configure stopword behavior is stopword_step. If it is set to one (which is a default value) the query “search on my site” will match with any of the following queries: “search on a site”, “search on the site”, or in general &#8220;search + ANY_OTHER_STOPWORD + ANY_OTHER_STOPWORD + site&#8221;. However, if it is set to zero it will also match with “search my site” or “search site”. In other words, all stopwords will just be ignored.</li>
<li><strong>min_word_len;</strong> This specifies the minimal word length to index. You need to be careful to correctly use the related directive overshort_step. If it&#8217;s set to 1 and your min_word_len = 2 the query &#8220;search on site&#8221; will not match &#8220;search on a site&#8221;, however if it&#8217;s set to 0 it will match that phrase.</li>
<li><strong>prefix/infix directives</strong> help you to filter out things that shouldn&#8217;t be indexed and on the contrary index substrings that should be.<br />
Not only is it important to index whole words, but sometimes it makes sense to be able to search by close variant words, e.g. if you type in &#8220;dogs&#8221; you might also want to find &#8220;dog&#8221; or you want &#8220;hero&#8221; to find &#8220;superhero&#8221;. This is enabled by a variety of directives: min_prefix_len, min_infix_len, prefix_fields, prefix_fields, infix_fields, enable_star, expand_keywords. Using these directives, you can configure exactly how your words should be split into substrings. You can specify how many characters can be trimmed on the end or on the both ends of the word, but also what full-text field this should be applied to, whether to support wild-card syntax (e.g. dog*) or treat all query words as substrings. Be aware that using the prefixes and the infixes increases your index size and might affect the performance.</li>
<li><strong>Exceptions and wordforms</strong><br />
Another thing Sphinx allows you to do is define list of words that should be mapped to each other which means you can tell Sphinx to treat USA, U.S.A, US, U.S, America, United Stated, United States of America as one and the same word, for example. To do this you should use the &#8216;exceptions&#8217; directive. It works on very low level before even tokenizing the text.  Using this, you can map all the words related to a specific one, for example USA and the same will happen with the search query. This might dramatically increase your search quality especially if you have to deal with products that can have different names and abbreviations, but all mean the same (PlayStation,  Play Station, PS, Sony PlayStation etc.). Another reason to use the exceptions is to index something which has a stopword or very short word which would otherwise be truncated. As an example, &#8220;The Matrix&#8221; would be converted to &#8220;Matrix&#8221; or &#8220;vitamin a&#8221; which would become &#8220;vitamin&#8221; when min_word_len = 2 or greater.<br />
*Note: Since exceptions work before tokenizing they have to be case sensitive (at least this is how it works now).</p>
<p>Example:<br />
U.S.A. =&gt; USA<br />
U.S. =&gt; USA<br />
US =&gt; USA<br />
us =&gt; USA</p>
<p>Sometimes it makes sense to map word to itself in the exceptions:<br />
AT&amp;T =&gt; AT&amp;T</p>
<p>This allows to let the user search for &#8216;AT&amp;T&#8217; and find excactly &#8216;AT&amp;T&#8217;, not separate words &#8216;AT&#8217; and &#8216;T&#8217; which would be return by the tokenizer if &#8216;&amp;&#8217; is a separator.</p>
<p>The &#8216;wordforms&#8217; directive is similar to the exceptions, with one difference. It&#8217;s applied <em>after</em> tokenizing so they&#8217;re case insensitive which is good, but you cannot use it to handle cases like &#8216;AT&amp;T&#8217; which is bad. However, the wordforms work much faster as they were designed to work with millions of different word forms and this can be especially useful when done along with stemming.</li>
<li><strong>Stemming</strong><br />
Using stemming you can improve your search quality even more. For instance if you enable English stemmer &#8216;walking&#8217;, &#8216;walks&#8217;, &#8216;walked&#8217; will be all converted to &#8216;walk&#8217;. The same will happen with the query and you will be able to find &#8216;he was walking on the street&#8217; by searching for &#8216;walked&#8217;.<br />
Sphinx enables English, Russian and Czech stemming out of the box and it was built with &#8211;&#8211;with_libstemmer it supports anything else using the <a href="http://snowball.tartarus.org/download.php">Snowball libstemmer library</a>.</p>
<p>These morphology processors are not perfect, but as I said the wordforms, they can be especially useful when combined with stemming. Because once the word is found in the wordforms it wont get processed later by the stemmer. Therefore, you can override something which doesn&#8217;t work perfectly. For example, &#8216;does&#8217; gets converted to &#8216;doe&#8217; by the English stemmer, but you can override this using the wordforms like this:</p>
<p>does &gt; do<br />
Likewise, it will also match &#8216;do&#8217; when &#8216;does&#8217; is typed in.</li>
<li><strong>html stripping</strong><br />
Another nice feature is html strippping. Often Sphinx is used to search among web pages which are HTML documents and using the &#8216;html_strip&#8217; directive you can do the html parsing job.<br />
This is important when you still need the markup in your datasource and don&#8217;t want to store both raw and stripped versions of the text.</li>
</ul>
<p>To figure out exactly how your current index settings work, you can use the &#8216;call keywords()&#8217; function in SphinxQL:</p>
<pre class="brush: bash">mysql&gt; call keywords('abc a b c the AT&amp;T A&amp;BULL', 'idx');
+-----------+------------+
| tokenized | normalized |
+-----------+------------+
| abc       | abc        |
| b         | b          |
| c         | c          |
| AT&amp;T      | AT&amp;T       |
| bull      | bull       |
+-----------+------------+
5 rows in set (0.00 sec)</pre>
<p>You can see that the text &#8216;abc a b c the AT&amp;T A&amp;BULL&#8217; was split into words &#8216;abc&#8217;, &#8216;b&#8217;, &#8216;c&#8217;, &#8216;AT&amp;T&#8217; and &#8216;bull&#8217;. &#8216;a&#8217; and &#8216;the&#8217; were skipped, because they&#8217;re stopwords, AT&amp;T is an exception and that&#8217;s why it was not split into &#8216;AT&#8217; and &#8216;T&#8217; as happened with &#8216;A&amp;BULL&#8217; which was split into &#8216;A&#8217; and &#8216;BULL&#8217; and then &#8216;A&#8217; was not indexed, because it&#8217;s a stopword.<br />
In the SphinxAPI this function is called &#8216;BuildKeywords()&#8217;.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-in-action-how-sphinx-handles-text-during-indexing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sphinx in Action: Autocomplete</title>
		<link>http://www.ivinco.com/blog/sphinx-in-action-autocomplete/</link>
		<comments>http://www.ivinco.com/blog/sphinx-in-action-autocomplete/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 05:36:31 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[Real Time indexes]]></category>
		<category><![CDATA[RT indexes]]></category>
		<category><![CDATA[Sphinx in action]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1074</guid>
		<description><![CDATA[This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by clicking here. Autocomplete provides suggestions as the user is typing in the search field. Everyone knows how it works, this is the first thing we see when using Google: Many people like this and want it in their [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by <a title="Sphinx in Action" href="http://www.ivinco.com/blog/tag/sphinx-in-action/">clicking here</a>.</p>
<p>Autocomplete provides suggestions as the user is typing in the search field. Everyone knows how it works, this is the first thing we see when using Google:</p>
<p><img src="http://www.ivinco.com/wp-content/uploads/2012/04/autocomplete_img.jpg" alt="" title="autocomplete_img" width="594" height="135" class="alignnone size-full wp-image-1165" /></p>
<p>Many people like this and want it in their applications and Sphinx can be used to activate this functionality.<br />
<span id="more-1074"></span></p>
<p>First, decide what things should be shown as suggestions. These can be:</p>
<ul>
<li>Blog post titles or anything else that makes sense in your application</li>
<li>Previous/top user queries</li>
<li>Separate words from your dataset prepared with &#8216;indexer &#8211;buildstops&#8217;, or by other means</li>
<li>Anything else that works</li>
</ul>
<p>We just need to gather the data we want to use in the autocomplete somewhere so it can be indexed by Sphinx or we can use Sphinx real time index to populate the autocomplete index (<a href="#rt">see below</a>).</p>
<p>So once we have our suggestions data, we need to build the index. Many people find it difficult to do since Sphinx requires a unique id for each document and sometimes some documents don’t have ids or they can intersect if you want to merge them (eg. recent user queries or product names). We can handle this by using hashes instead of the ids. This can be <a href="http://www.divinekonection.info/articles/CRC-Error-Detection--How-CRC-Works-a13.html">crc32()</a> if your autocomplete index is not very large or something else like <a href="http://stackoverflow.com/questions/4940316/sphinx-and-guids?answertab=active#tab-top">half of md5()</a> if your index is large enough to cause <a href="http://www.quora.com/Hashing/What-is-the-probability-of-having-CRC32-collision-on-simple-strings">crc32() collisions</a>.</p>
<p>The other thing we need to look at is how we will sort the suggestions. To do this, we can make a few attributes. We often use attributes such as suggestion length and suggestion word count (which can be easily calculated using the &#8216;sql_attr_str2wordcount&#8217; directive). Here&#8217;s an example:</p>
<pre class="brush:bash">
source keyword {
        sql_query = select crc32(name), name keyword, length(name) length from animals
        sql_attr_uint           = length
        sql_field_string        = keyword
        sql_attr_str2wordcount  = keyword_wc
...
}

index keyword {
        source                  = keyword
        path                    = keyword
        docinfo                 = extern
        min_prefix_len          = 1
...
}
</pre>
<p>The problem with using the hash is we will most likely have difficulty finding the text by hash in the db, but we won&#8217;t need this if we have all our strings set up right in the index using Sphinx string attributes (sql_field_string). There&#8217;s another reason to do so &#8211; users attack this index more frequently than the others because each typed character will generate a query against Sphinx. Thus we need to optimize it to provide the best performance. We can accomplish this nicely by storing the strings in memory. This removes the need to look in the db, and consequently, it is very fast.</p>
<p>It is important to make sure we allocate enough memory for the index so Sphinx doesn&#8217;t do unwanted reads from the disk:</p>
<pre class="brush:bash">
index keyword {
...
        mlock                   = 1
}
</pre>
<p>Now, all we need to do is query the index like this (or use other attributes to sort the results):</p>
<pre class="brush:bash">
mysql> select keyword from keyword where match('^b*') order by keyword_wc asc, length asc limit 10 option max_matches=10, ranker=none;
+-----------+
| keyword   |
+-----------+
| bee       |
| bat       |
| bear      |
| boar      |
| bison     |
| beaver    |
| buffalo   |
| bush baby |
+-----------+
</pre>
<p>If we don&#8217;t need to use text ranker (i.e. sorting by attributes is enough) we need to make sure to set &#8216;ranker=none&#8217; option. This will improve performance.</p>
<p><a name="rt"></a><br />
It is good to apply the Sphinx real time index to autocomplete suggestions, because:</p>
<ul>
<li>We won&#8217;t need to rebuild this index if it is real time. We just need to insert into it in the same code we use to insert it into the db.</li>
<li>When we use last user queries we can insert them into the real time index <em>only</em> (unless you need it in the db as well).</li>
</ul>
<p>Although the Sphinx real time indexes have a <a href="http://www.ivinco.com/blog/sphinx-in-action-good-and-bad-in-sphinx-real-time-indexes/">few disadvantages</a> they&#8217;re not very critical with respect to autocomplete. Since the index isn’t usually very big and we&#8217;re going to store the strings in memory anyway, the difference – when compared to a non-real time index, will be even less significant in terms of resources consumption.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-in-action-autocomplete/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Sphinx in Action: It all starts with indexing</title>
		<link>http://www.ivinco.com/blog/sphinx-in-action-it-all-starts-with-indexing/</link>
		<comments>http://www.ivinco.com/blog/sphinx-in-action-it-all-starts-with-indexing/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 05:06:15 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[Configuration]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Main+delta]]></category>
		<category><![CDATA[Sphinx in action]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1058</guid>
		<description><![CDATA[This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by clicking here. Sphinx usage in any project usually starts with building the Sphinx configuration and indexing your data. In this article I&#8217;m going to point out some cool things that Sphinx provides regarding the configuration and how your [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by <a title="Sphinx in Action" href="http://www.ivinco.com/blog/tag/sphinx-in-action/">clicking here</a>.</p>
<p>Sphinx usage in any project usually starts with building the Sphinx configuration and indexing your data. In this article I&#8217;m going to point out some cool things that Sphinx provides regarding the configuration and how your data is indexed:</p>
<ul>
<li>First, I have to say that it supports inheritance. This is very useful because you can define things (like the connection to your database) only once and they will be reused for all the other sources, here&#8217;s an example:<br />
<span id="more-1058"></span></p>
<pre class="brush:bash">source text1
{
        type            = mysql
        sql_host        = localhost
        sql_user        = b
        sql_pass        = u
        sql_db          = b
        sql_port        = 3306
        sql_query       = select id, body, published, lat, long, category from table
        sql_attr_timestamp      = published
        sql_attr_float  = lat
        sql_attr_float  = long
        sql_attr_uint   = category
}

source text2 : text1
{
        sql_query       = select id, user_name, inserted from table2
        sql_attr_timestamp      = inserted
        sql_attr_float  =
        sql_attr_uint   =
}</pre>
</li>
<li>If you&#8217;re a real hater of copy-paste technology, you&#8217;ll be happy to know that Sphinx config supports shebang which allows you to create your Sphinx config using your favorite scripting language, for example:
<pre class="brush:bash">#!/usr/bin/php
&lt;?php $m = new mysqli('maindb', 'user', 'password', 'main'); $res = $m--->query("select site_map.id, ip from site_map left join server on site_map.master_id = server.id");
while ($row=$res-&gt;fetch_assoc()) {
        $n = $row['id'];
        $host = $row['ip'];
        echo "
source chunk{$n} {
    type = mysql
    sql_host = {$host}
    sql_user = user
    sql_pass = pass
    sql_db = c{$n}
    sql_query = select id, {$n} chunk_id, body from a{$n} where id&gt;=\$start AND id&lt;=\$end and crawled=0
    sql_query_range     = SELECT MIN(id),MAX(id) FROM a$n
    sql_range_step = 100000
}
";
}
...</pre>
</li>
</ul>
<p>This creates many possibilities of making the Sphinx config really dynamic. It is especially important in large applications that have a lot of indexes and multiple servers. You can set up your config just once, and as you scale your project, it will rebuild itself automatically. One thing you will need to be careful is sending signal to the running searchd process so it starts using the new config. You can make life even easier by incorporating the signal sending into the config so that once you reindex, and see that the config is updated, you can inform your searchd process about this and it will switch to the new config.</p>
<p>Another cool thing you can do with data indexing is using a so-called main + delta scheme. Here’s what it does:</p>
<ul>
<li>It reduces the frequency that the main part of the index needs to be rebuilt.</li>
<li>When you update a field in your data source and need to update this in your Sphinx index, you only need to rebuild the delta. When the delta is big enough and takes significant time to rebuild itself, you can dump all the data into the main index. This empties the delta and readies it to accept new data and start rebuilding fast again.</li>
<li>It makes your updated data appear in the index much faster.</li>
<li>By fetching less data from the database, it reduces the weight of the load on your server.</li>
</ul>
<p>There are two main approaches when it comes to the main+delta:</p>
<ul>
<li>Split data by &#8216;id&#8217;:
<pre class="brush:bash">source main {
        sql_query_range = select min(id), max(id) from dogs
        sql_query       = select id, name from dogs where id &gt;= $start and id &lt;= $end
        sql_range_step = 1000
        sql_query_post_index = replace into sphinx_helper (type, const) values ('dogs', $maxid)'
...
}

source delta : main {
        sql_query_range = select const, (select max(id) from dogs)) from sphinx_helper where type = 'dogs'
        sql_query_post_index = replace into sphinx_helper (type, const) values ('dogs', $maxid)'
...
}</pre>
<p>This is a more traditional method; however, the drawback is that if an older document gets updated and its &#8216;id&#8217; is unchanged, it won&#8217;t be reindexed until main part of the index is rebuilt.
</li>
<li>Split data by &#8216;updated&#8217;:
<pre class="brush:bash">source main {
        sql_query_pre   = REPLACE INTO sphinx_helper set type = 'cats', tmp = (select max(updated) from cats)
        sql_query_range = select unix_timestamp(min(updated)), (select tmp from sphinx_helper where type = 'cats') from assets
        sql_query               = select id, name from cats where updated &gt;= $start and updated &lt;= $end
        sql_query_post_index = update sphinx_helper set const = tmp where type = 'cats'
        sql_range_step  = 3600 ...
}
source delta : main {
        sql_query_range = select const, unix_timestamp() from sphinx_helper where type = 'cats'
        sql_query_killlist = select id from cats u  where updated &gt; (select const from sphinx_helper where type = 'cats')
        sql_query_post_index =
...
}</pre>
<p>This aims to avoid the drawback of the split by id approach: once an old document gets updated its &#8216;updated&#8217; field value will be updated as well, and it will be indexed as soon as the delta index is rebuilt. Technically the updated document will now be in both the delta and the main index parts. Using sql_query_killlist allows us to explicitly tell Sphinx to use the one in the delta.<br />
You should be careful to pick the best sql_range_step value to define the period of time used to fetch docs from the db at once. 3600 means one hour (3600 seconds). If the value is too low, there will be too many queries that don’t return results and it will waste resources and indexing time. On the other hand, if too much time is allowed, too much data could be fetched from the db, which could overload it.
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-in-action-it-all-starts-with-indexing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sphinx in Action: Did you mean &#8230; ?</title>
		<link>http://www.ivinco.com/blog/sphinx-in-action-did-you-mean/</link>
		<comments>http://www.ivinco.com/blog/sphinx-in-action-did-you-mean/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 04:47:31 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Tips]]></category>
		<category><![CDATA[Sphinx]]></category>
		<category><![CDATA[Sphinx in action]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1078</guid>
		<description><![CDATA[This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by clicking here. Many of our customers want to incorporate &#8220;did you mean &#8230; ?&#8221; functionality into their applications. How it works is that when a typo is made in the query, a corrected version of the typo is [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by <a title="Sphinx in Action" href="http://www.ivinco.com/blog/tag/sphinx-in-action/">clicking here</a>.</p>
<p>Many of our customers want to incorporate &#8220;did you mean &#8230; ?&#8221; functionality into their applications. How it works is that when a typo is made in the query, a corrected version of the typo is suggested:</p>
<p>This can be done using a special technique with Sphinx that has been successfully used in many different projects.</p>
<p>There&#8217;s a demo of this technique in misc/suggest/ dir in the Sphinx source and an <a href="http://habrahabr.ru/post/61807/">article</a> (in Russian) where Andrew Aksyonoff describes the main idea behind this. The following is based on his original idea:<span id="more-1078"></span></p>
<p>The technique is based on comparing the difference between the words in the current query and words from a dictionary. How this works is:</p>
<ol>
<li>You should decide what dictionary of proven words would be most suitable. It can be some real dictionary, titles of your products, or you can even generate a new dictionary based on your current index using the &#8216;indexer &#8211;buildstops … &#8211;buildfreqs&#8217;      command.</li>
<li>You can modify each word in the dictionary by splitting it into characters and groups of characters: bigrams,      trigrams and so on. For example, you would convert <em>&#8216;mysql&#8217;</em> to <em>&#8216;m y s q l _m my ys sq ql l_ __m _my mys ysq sql ql_ l__&#8217;</em>.      In this case the underscore is used to distinguish a letter in the beginning or end of the word from a letter in the middle.</li>
<li>Next, you will index your dictionary and word modifications. You can also index attributes like word      length, or anything else that will help sort the results. Other examples might be: word frequency if using &#8211;buildfreqs indexer key, or product      popularity rating.</li>
<li>Then you do the same with queried keywords, for example if it&#8217;s <em>&#8216;mysslq&#8217;</em> it becomes <em>&#8216;m      y s s l q _m my ys ss sl lq q_ __m _my mys yss ssl sly lq_ q__&#8217;</em>, you can put letters, bigrams and trigrams into separate fields in order to improve the quality.</li>
<li>Finally, you search for the calculated string in the index using the &#8220;…&#8221;/1 syntax (i.e. the <a href="http://sphinxsearch.com/docs/current.html#extended-syntax">quorum operator</a>).</li>
</ol>
<p>The idea is that the many parts of the mistyped query will match with their respective best suggestions.</p>
<p>To improve quality you can use ranker=wordcount (SPH_RANK_WORDCOUNT in the API) or ranker=proximity (SPH_RANK_PROXIMITY in the API) to calculate the weight based on proximity or the number of matching words, not the statistics based BM25 algorithm.</p>
<p>Not only can you sort by full-text weight, but also by the difference between the mistyped query and the suggestion (the less difference the better).</p>
<p>You can also filter by length to avoid suggestions that are too short or too long. It is unusual that the length of a mistyped query differs much from the correct one&#8217;s length (usually just 2-3 letters).</p>
<p>The example is:</p>
<pre class="brush:bash">
	mysql> select keyword, len, freq, @weight + 2 - abs(7 - len) final from suggest where match('@trigrams "__m _ms mse sea eag age ge_ e__ "/1 @bigrams "_m ms se ea ag ge e_ "/1 @onegrams"m s e a g e "/1') and len >= 5 and len <= 9 order by final desc, freq desc limit 10 option ranker=wordcount;
	+-------+--------+------+------+----------+-------+
	| id | weight | freq | len | keyword | final |
	+-------+--------+------+------+----------+-------+
	| 3425  | 17 | 1560   | 7 | message | 19 |
	| 17492 | 17 | 288    | 7 | mileage | 19 |
	| 28521 | 16 | 163    | 7 | massage | 18 |
	| 10566 | 15 | 504    | 8 | messages| 16 |
	| 38476 | 14 | 114    | 7 | teenage | 16 |
	| 53885 | 14 | 74     | 7 | baggage | 16 |
	| 7198  | 13 | 755    | 7 | average | 15 |
	| 14641 | 14 | 350    | 8 | marriage| 15 |
	| 16844 | 14 | 301    | 6 | manage  | 15 |
	| 20092 | 13 | 246    | 7 | disease | 15 |
	+-------+--------+------+------+----------+-------+
	10 rows in set (0.05 sec)
</pre>
<p>To improve the quality even more, it makes sense to run additional calculations in the application AFTER you've fetched the docs from Sphinx. The simplest thing you can do is to calculate the levenshtein distance for the first ten words suggested by Sphinx.</p>
<p>You can also play with the weights of the fields (@letters, @bigrams, and @trigrams), the thresholds, and other attributes to find which settings will work best for your data.</p>
<p>Of course the suggested keywords can’t be correct 100% of the time, because there are some cases when two or more suggestions are correct. It produces good results in most cases. For example, each of the following typos of the word 'message' get successfully converted to 'message' or 'messages':</p>
<p>essage mssage mesage mesage messge messae messag mmessage meessage messsage messsage messaage messagge messagee emssage msesage mesasge messgae messaeg nessage jessage kessage mwssage m3ssage m4ssage mrssage mfssage mdssage msssage measage mewsage meesage medsage mexsage mezsage mesaage meswage meseage mesdage mesxage meszage messqge messwge messsge messxge messzge messafe messate messaye messahe messabe messave messagw messag3 messag4 messagr messagf messagd messags nmessage mnessage jmessage mjessage kmessage mkessage mwessage mewssage m3essage me3ssage m4essage me4ssage mressage merssage mfessage mefssage mdessage medssage msessage messsage meassage mesasage mewssage meswsage meessage mesesage medssage mesdsage mexssage mesxsage mezssage meszsage mesasage messaage meswsage messwage mesesage messeage mesdsage messdage mesxsage messxage meszsage messzage messqage messaqge messwage messawge messsage messasge messxage messaxge messzage messazge messafge messagfe messatge messagte messayge messagye messahge messaghe messabge messagbe messavge messagve messagwe messagew messag3e message3 messag4e message4 messagre messager messagfe messagef messagde messaged messagse messages</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-in-action-did-you-mean/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Sphinx in Action: Good and bad in Sphinx real time indexes</title>
		<link>http://www.ivinco.com/blog/sphinx-in-action-good-and-bad-in-sphinx-real-time-indexes/</link>
		<comments>http://www.ivinco.com/blog/sphinx-in-action-good-and-bad-in-sphinx-real-time-indexes/#comments</comments>
		<pubDate>Tue, 10 Apr 2012 12:32:20 +0000</pubDate>
		<dc:creator>Sergey Nikolaev</dc:creator>
				<category><![CDATA[Sphinx search engine]]></category>
		<category><![CDATA[Indexing]]></category>
		<category><![CDATA[Real Time indexes]]></category>
		<category><![CDATA[RT indexes]]></category>
		<category><![CDATA[Sphinx in action]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1087</guid>
		<description><![CDATA[This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by clicking here. Sphinx has supported real time indexes since version 1.10 was released. Ever since, they have been getting more stable and robust, and now it is ready for use in production. Many people like this because [...]]]></description>
			<content:encoded><![CDATA[<p>This is a post from a series of posts called &#8220;Sphinx in Action&#8221;. See the whole series by <a title="Sphinx in Action" href="http://www.ivinco.com/blog/tag/sphinx-in-action/">clicking here</a>.</p>
<p>Sphinx has supported real time indexes since version 1.10 was released. Ever since, they have been getting more stable and robust, and now it is ready for use in production. Many people like this because it&#8217;s really simple to understand (i.e. no indexing, crontasks, main + delta schemes, and so on), but anyone who wants to use them should also be aware of the drawbacks of this when comparing it to traditional monolithic indexes.</p>
<p><span id="more-1087"></span></p>
<p>A real time index consists of 2 parts: one that is stored in memory, and the other is stored on a disk. Once a new insert/update query is sent to a real time index, it updates the memory as well. It works very fast, but the memory isn&#8217;t unlimited. So as the number of queries begin to reach it&#8217;s maximum and once rt_mem_limit is exceeded on the memory side, it gets converted into a disk index chunk, and so on. If you have a 10Gb index and rt_mem_limit is set to 1Gb, then you will end up with 10 disk chunks. If you set re_mem_limit to 10Gb you will have only 1 disk chunk. Now here&#8217;s the dilemma: the more disk chunks you have the worse your search performance will be, but on the other hand less memory is needed to support the real time index. On the contrary, if you set rt_mem_limit to a high value you will have good search performance because you will have fewer disk chunks, but it will take up more memory on your server. Unfortunately, the amount of memory that a Sphinx real time index requires is much more than what a traditional index needs. This is because a traditional index only stores attributes and wordlists while a real time index stores everything else as well until it&#8217;s converted into a disk chunk.</p>
<p>Here&#8217;s what it looks like for the same data (1M docs) when rt_mem_limit is high enough:</p>
<p>Traditional index:<br />
<code>[root@SE01 snikolaev]# ls -sh idx.sp*<br />
<strong>12M idx.spa</strong><br />
186M idx.spd<br />
<strong>8.0K idx.sph </strong><br />
<strong>11M idx.spi</strong><br />
4.0K idx.spk<br />
4.0K idx.spl<br />
4.0K idx.spm<br />
103M idx.spp<br />
8.0K idx.sps<br />
</code></p>
<p>Real time index:<br />
<code>[root@SE01 snikolaev]# ls -sh idx_rt.*<br />
8.0K idx_rt.kill<br />
4.0K idx_rt.lock<br />
8.0K idx_rt.meta<br />
<strong>442M idx_rt.ram</strong><br />
</code></p>
<p>The bolded lines are the things that are stored in memory. As you can see, the traditional index requires 23Mb while the real time index needs more than 400Mb of memory.</p>
<p>In practice, we usually recommend real time indexes in 2 cases:</p>
<ol>
<li>When the data volume is really small and is not going to grow quickly. Indeed, it doesn&#8217;t actually matter whether you spend 5Mb or 100Mb if you have few gigs of memory and using the real time indexes will make sense for you because you will avoid having an indexing routine and will be able to synchronize your database and your search index on a data insert level.</li>
<li>When the data volume is large and growing, but you want to reduce indexing latency (i.e. you want your data to appear in the index in a real time manner). And here comes the best part of using real time indexes.  They can be combined with traditional indexes using a Sphinx distributed index. What you can do in this case is still use the main + delta scheme, but make the delta real time, once the main part is rebuilt you then should clean the real time delta index. Since the delta is real time it enables real time data accessibility in the index and because it&#8217;s the delta, it doesn&#8217;t require a lot of memory. The only routine is to flush it periodically to the main part. To empty the real time index the latest Sphinx builds provide &#8220;TRUNCATE RTINDEX&#8221; SphinxQL command.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/sphinx-in-action-good-and-bad-in-sphinx-real-time-indexes/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Meet Ivinco at Sphinx Search Day 2012!</title>
		<link>http://www.ivinco.com/blog/meet-ivinco-at-sphinx-search-day-2012/</link>
		<comments>http://www.ivinco.com/blog/meet-ivinco-at-sphinx-search-day-2012/#comments</comments>
		<pubDate>Thu, 23 Feb 2012 14:48:15 +0000</pubDate>
		<dc:creator>Mindaugas</dc:creator>
				<category><![CDATA[News]]></category>

		<guid isPermaLink="false">http://www.ivinco.com/?p=1048</guid>
		<description><![CDATA[The Sphinx team has just announced The Sphinx Search Day, which will take place in Santa Clara, California on April 13 -  just after the MySQL Conference And Expo. This is first Sphinx event in USA and a great opportunity for Sphinx users in US to get together and meet each other and the Sphinx team. [...]]]></description>
			<content:encoded><![CDATA[<p><a rel="attachment wp-att-1051" href="http://www.ivinco.com/blog/meet-ivinco-at-sphinx-search-day-2012/sphinx/"><img class="size-full wp-image-1051 alignright" title="sphinx" src="http://www.ivinco.com/wp-content/uploads/2012/02/sphinx.jpg" alt="" width="200" height="51" /></a>The Sphinx team has just announced <a href="http://sphinxsearch.com/conference2012/">The Sphinx Search Day</a>, which will take place in Santa Clara, California on April 13 -  just after the <a href="http://www.percona.com/live/mysql-conference-2012/">MySQL Conference And Expo</a>.</p>
<p>This is first Sphinx event in USA and a great opportunity for Sphinx users in US to get together and meet each other and the Sphinx team. As the organizers say:</p>
<blockquote><p>The aim of the Sphinx Search Day 2012 is to provide a technical forum to educate those who are new to Sphinx and drive innovation for long-time users. The majority of the talks will be technical in nature and more importantly delivered by real-world users and Sphinx community members.</p></blockquote>
<p>I am also very glad that Ivinco was invited to this event. In our talk we will share our experience in building Sphinx systems from small projects to multi-terabyte search engines.</p>
<p>This is a free event &#8211; if you are around the Silicon Valley in April &#8211; go ahead and <a href="http://sphinxsearch.com/conference2012/">register</a>. We look forward to meeting you at the Sphinx Search Day!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ivinco.com/blog/meet-ivinco-at-sphinx-search-day-2012/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
