Improve CJK tokenizer support

diggy · November 30, 2018, 11:04pm

Moved from GitHub dgraph/2801

Posted by srfrog:

The current CJK tokenizer in v1.0.10 is the one included in Bleve. It has limited support and can yield extra tokens when that aren’t needed. We need to use a CJK package/library specifically designed for CJK support that can handle these languages better.

For example, the term “first name” or “名字” is tokenized as “名”, “字”. But in this form, “字” is “name” and “字” is “word”, so we have lost “first” as a token. So a fulltext/term lookup for “字” won’t return the expected results. The expected term should have “名字”.

Some CJK packages considered are:

GitHub - yanyiwu/gojieba: "结巴"中文分词的Golang版本 (has Bleve support)
GitHub - huichen/sego: Go中文分词

Refers #1421

diggy · July 8, 2020, 1:39am

ls84 commented :

Yes, this issue cause unexpected search result. it makes anyofterm(text, "名字") intoanyofterm(text, "名字"), but hese two characters ”名“ and ”字“ together should be seem as a word.

This makes search very unperdicatble for chinese language, you can only use allofterm() for now, but it still treat “名字” as two separate words, which returns many extra search results.

Topic		Replies	Views
@lang not indexing correctly, breaking `anyofterms` and `allofterms` Dgraph status:accepted , kind:bug , ticket:created , tokenization	7	1286	January 18, 2021
Anyofterms doesn’t work as expected with Chinese characters Dgraph i18n , tokenization , unicode	10	1068	January 13, 2021
Anyofterms doesn't work as expected with Chinese characters Dgraph	5	1098	December 3, 2018
Tokenizer for indexer Users	13	1879	November 28, 2017
The allofterms fuzzy search produces incorrect results Dgraph	6	761	July 6, 2021

Improve CJK tokenizer support

Related topics