Using Sphinx search engine with Chinese, Japanese, and Korean language documents

This article explains step by step how to implement full-text search on a set of documents written in Chinese, Korean and Japanese languages (CJK).

About CJK languages

CJK languages have more than 40,000 characters. Most of them are Chinese. Sometimes you can see acronym CJKV. “V” here stands for the Vietnamese language.

CJK characters include:

  1. For the Chinese language: hànzì – traditional Chinese characters; Bopomofo – Chinese Phonetic Alphabet; Pinyin – Romanization of Chinese language (a concept close to the concept of transliteration).
  2. For the Japanese language: Hiragana – Japanese syllabary; Katakana – Japanese syllabary; Arabic numerals.
  3. For the Korean Language: Hangul (Korean alphabet)

In addition, each language has a set of hieroglyphic keys (radicals), which act as a grouping elements to search for characters in the dictionary or as a semantic elements that define the meaning of the characters following the key.

To display text in CJK languages you can use the following encodings: Big5, EUC-JP, EUC-KR, ISO 2022-JP, KS C 5861, Shift-JIS, Unicode, etc. When implementing the full text search for CJK text with Sphinx it is best to use Unicode (UTF-8 encoding) ( For CJK-language alphabets there are such Unicode blocks (

The range Block Comments
1100 .. 11FF Hangul Jamo A single character out of a syllable in the Korean Hangul alphabet. Letters Jamo used to form the syllables Hangul
2E80 .. 2EFF CJK Radicals Supplement Key (radical) – an element of the hieroglyphic alphabet, which allows grouping of words or acts as a semantic element that defines the meaning of the following characters.
2F00 .. 2FDF Kangxi Radicals list of keys Kangxi adopted in Japan, Korea, Taiwan, traditionally includes 214 characters
3000 .. 303F CJK Symbols and Punctuation Ideographic characters and punctuation
3040 .. 309F Hiragana Japanese syllabary
30A0 .. 30FF Katakana Japanese syllabary
3100 .. 312F Bopomofo Chinese Phonetic Alphabet
3130 .. 318F Hangul Compatibility Jamo
3190 .. 319F Kanbun Camboon or kanbun One of the written languages of medieval Japan
31A0 .. 31BF Bopomofo Extended
31C0 .. 31EF CJK Strokes simple features (elements) characters
31F0 .. 31FF Katakana Phonetic Extensions
3200 .. 32FF Enclosed CJK Letters and Months CJK letters and months in circles
3300 .. 33FF CJK Compatibility
3400 .. 4DBF CJK Unified Ideographs Extension A CJK Ideographs
4DC0 .. 4DFF Yijing Hexagram Symbols
4E00 .. 9FFF CJK Unified Ideographs Ideographs – written sign, conditional image or picture, is not the appropriate speech sounds, and whole word
A000 .. A48F Yi Syllables Yi language The language of the province of South Sichuan
A490 .. A4CF Yi Radicals
AC00 .. D7AF Hangul Syllables Syllables Hangul
D7B0 .. D7FF Hangul Jamo Extended-B
20000 .. 2A6DF CJK Unified Ideographs Extension B
2A700 .. 2B73F CJK Unified Ideographs Extension C
2F800 .. 2FA1F CJK Compatibility Ideographs Supplement

Note that the Arabic numerals, which can be used in CJK texts, correspond widespace character codes (see section FFF0 .. FFFF; Specials).

You can see here how certain characters look.

How to tell Sphinx that your document has CJK characters?

For indexer to index CJK documents properly, you have to set these parameters to the index configuration file:

  1. charset_type – determines the type of encoding of the documents that will be indexed. It may have value “SBCS” – Single Byte Character Set (default) or “utf-8″.
  2. charset_table – main parameter to describe the characters. Contains a table of symbols and rules for case folding.
  3. ngram_chars – description of characters needed to split CJK text to words using the N-gram model;
  4. Set the value ngram_len to 1. (We will describe the meaning of this in further posts. 1 is currently the only value which this setting can be set to.)

Points 1 – 4 should be applied to ‘index name {…}’ section of the configuration file. If some characters are not included in the charset_table list they’re treated as delimiters (space characters) by Sphinx indexer. Character set is the same for indexing, query parsing, searching and building excerpts within one index where it was set.

How to create descriptions for the parameters charset_table and ngram_chars

Or in other words, how to explain Sphinx, which UTF-8 character codes belong to the family of CJK languages?

You can use the sets for blocks of language: or using the data in the table above and the rules set in here ( to make your description of the options (see 1-4 above) for the characters and letters for CJK languages. Be careful and double check that all blocks of the character ranges that you need are included into Sphinx index character description in configuration file. For example, if you would use character set range descriptions that you get on the link above for indexing documents containing Lisu or Vai languages, search will not work properly.

Pay special attention to setting the ngram_chars parameter correctly. When searching Sphinx will not look into these characters as search matches. (This can be very painful when you spend few hours for indexing the documents.)


In real life description charset_table would be huge (compared with the size of this article), see
Next step is to index (or re-index) all documents set with new charset_table parameter value.

What else can you do?

If you are still alive after reading all the above it should not be difficult for you to build the Sphinx  index for documents with CJK (as well as for any other language). You just need to set the right settings as we described above. We recommend such steps:

  1. Find out what Unicode blocks should be used for the language or dialect for which you want to create a search.
  2. Set the appropriate options in configuration file to describe the index.
  3. Index (or re-index) data.

Useful links:


blatJanuary 8th, 2013 at 12:04 pm

thanks.. Now i can fulltext usernames on my forum, like 趙升巍

EZ93June 18th, 2014 at 2:53 pm

You have no idea the pain that we have gone through to try and get to the bottom of wtf was going on with spinx and Chinese, all seemed to work with Spanish to a degree.. seriously good article thanks for knocking months of our development time: the answer for us was

charset_type = utf-8
charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
ngram_len = 1
ngram_chars = U+3000..U+2FA1F

richardJuly 7th, 2015 at 3:15 am

Hello, sorry my English very poor.
ngram_len = 1
ngram_chars = U+3000..U+2FA1F
have a bug
Example I search “为什么”,some data not include “为什么”,have “什么”,but will show this data.
[words] => Array
[为] => Array
[docs] => 210
[hits] => 214

[什] => Array
[docs] => 299
[hits] => 302

[么] => Array
[docs] => 460
[hits] => 465

so, sphinx create index by everyone character

Sergey NikolaevJuly 7th, 2015 at 3:22 am

Hi Richard

Are you sure your charset_table also includes these characters’ codes?

richardJuly 7th, 2015 at 3:24 am

yes I’m sure
source = ys_www_zj_content
path = /service/sphinx/var/data/ys_www_zj_content
docinfo = extern
dict = keywords
min_word_len = 1

charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF
ngram_len = 1
mlock = 0
ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF

richardJuly 7th, 2015 at 3:26 am

keyword is 为什么
if I set
ngram_len = 1

Will goted some include 什么 data
if i set ngram_len = 0
I just get like “为什么……” data, ‘xxxx为什么xxxx’ can’t find.

richardJuly 7th, 2015 at 3:27 am

I user php
$sphinx->SetSortMode(SPH_MATCH_EXTENDED, “id DESC”);
$sphinx->SetLimits( ($page-1)*$pageSize , $pageSize , 320);
$res = $sphinx->query(‘@title ‘.”(为什么)”, $index);

Sergey NikolaevJuly 7th, 2015 at 12:43 pm

Hi Richard

I’ve tested this in my sandbox. It’s not actually related with presence of the char in charset_table. In latest version you can’t even have the same char in both charset_table and ngram_chars. But anyway I can’t reproduce your problem:
[snikolaev@dev01 ~]$ cat sphinx_ngram.conf
source min
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = test
sql_query = select 1, ‘什么’ body

index idx_min
path = idx
source = min
docinfo = extern
ngram_len = 1
ngram_chars = U+3000..U+2FA1F

query_log = query_min.log
listen = 9307:mysql41
log = sphinx.log
pid_file =
binlog_path =

and all works fine:

mysql> select * from idx_min where match(‘为什么’);
Empty set (0.01 sec)

mysql> select * from idx_min where match(‘什么’);
| id |
| 1 |
1 row in set (0.00 sec)

Sergey NikolaevJuly 7th, 2015 at 1:44 pm

I’ve tried your charset_table/ngram_chars. The only problem I see is that a newer Sphinx version will not let you build the index with the following error:
FATAL: index ‘idx_min’: ‘ngram_chars’: ngram characters must not be referenced anywhere else (code=U+1100)

Otherwise it works properly. I can’t find document “什么” by keyword “为什么”

richardJuly 7th, 2015 at 1:49 pm

Hi Sergey Nikolaev:
Thanks your answer.
You means install SphinxSE?

Leave a comment

Your comment

Notify me of followup comments via e-mail. You can also subscribe without commenting.