Anyofterms doesn’t work as expected with Chinese characters

Soultrans · August 12, 2020, 2:00am

Hi,I had the same problem,I found that Anyofterms can’t be used normally in Chinese.I did the same thing, and used @zh .but it didn’t work very well.

http://discuss.dgraph.io/t/anyofterms-doesnt-work-as-expected-with-chinese-characters/3581/5?u=soultrans

Neeraj · August 12, 2020, 2:10am

Hey @Soultrans,

Can you provide the dgraph version that you’re using and an example that’s not working for you?

Soultrans · August 12, 2020, 2:46am

I’m using version 20.03.

Examples are as follows：

{
all(func: eq(class, “比赛日程”)){
class
related@filter(has(game_type) and anyofterms(game_class@zh ,“<第一轮第二轮>”) ) {
uid
game_class
content
}
}
}

I want “第一轮” or “第二轮” information.However, the result will return the words “第”，“第一”，“第二”，“轮”…

Naman · August 12, 2020, 8:40am

Hi, can you please provide a small dataset and schema as well to help us easily reproduce this? Also, have you tried running it on latest release “Shuri”?

chewxy · August 19, 2020, 1:16pm

General Reason Why It Doesn’t Work

This seems to be a word boundary issue. In English, words are bounded by whitespace so that “hello world” is two words. However in Chinese, word boundary is a Hard Problem (like you need proper machine learning to solve it).

“第一轮” translates to “round one” in English. In English it’s 2 words, "round" and "one" In Chinese, it’s three characters, each of which may or may not be a word.

So when you write anyofterms(game_class@zh, "第一轮第二轮") it’s equivalent of writing in English: anyofterms(game_class, "round one round two"), which will be uniquified to be anyofterms(game_class, “round one two”)`. This is fine for English. It’s not fine for Chinese.

Specific Reason Why It Doesn’t Work Right Now

Specifically, the Bleve tokenizer that Dgraph uses does not do language based segmentation. So after passing the query through the tokenizer and then uniquifying the results, we get the equivalent of running anyofterms(game_class@zh, "第一轮二")

A (Temporary) Solution

Here I would like to make an observation. You seem to be searching for a particular pattern that matches a Chinese phrase - 第.+轮. So maybe try func: regexp(name@zh, /^第.+轮*$/) and see if that helps?

dmai · August 19, 2020, 2:12pm

I figured Dgraph is doing language-specific term indexing? Correct me if I’m wrong, but this is set up in bleve.go:

github.com

dgraph-io/dgraph/blob/67a221a87c24bb701d6b81bb5e2f0b6bf5d05306/tok/bleve.go#L37-L58


      
          // setupBleve creates bleve filters and analyzers that we use for term and fulltext tokenizers.
          func setupBleve() {
          	// unicode normalizer filter - simplifies unicode words using Normalization Form KC (NFKC)
          	// See: http://unicode.org/reports/tr15/#Norm_Forms
          	_, err := bleveCache.DefineTokenFilter(unicodenormName,
          		map[string]interface{}{
          			"type": unicodenorm.Name,
          			"form": unicodenorm.NFKC,
          		})
          	x.Check(err)
          
          	// term analyzer - splits on word boundaries, lowercase and normalize tokens.
          	termAnalyzer, err = bleveCache.DefineAnalyzer("term",
          		map[string]interface{}{
          			"type":      custom.Name,
          			"tokenizer": unicode.Name,
          			"token_filters": []string{
          				lowercase.Name,
          				unicodenormName,
          			},

This file has been truncated. show original

chewxy · August 19, 2020, 11:53pm

Not really. The code there has no reference to language specific analysis. According to their docs, what we’re doing in DefineTokenFilter and DefineAnalyzer is just to first do lowercase then unicode normalization.

Language specific tokenization is done in other analyzers in Dgraph, but not for the things that anyofterms or allofterms hits.

mrjn · August 20, 2020, 10:29am

Can you look into this a bit deeper and see if we should simplify the way we do tokenization @chewxy?

chewxy · January 13, 2021, 1:57am

2 posts were split to a new topic: Lang@ not indexing correctly

chewxy · January 13, 2021, 4:20am

Closing this issue as there was a fix made (though the fix may have caused a second set of issues - see the split topic.)

Topic		Replies	Views
Anyofterms doesn't work as expected with Chinese characters Dgraph	5	1049	December 3, 2018
@lang not indexing correctly, breaking `anyofterms` and `allofterms` Dgraph status:accepted , kind:bug , ticket:created , tokenization	7	1242	January 18, 2021
Some bugs in query when try utf8 predicates Dgraph	7	1706	June 14, 2018
Term search problem when use `-` in string data Dgraph kind:question , dgraph	2	414	March 9, 2021
Improve CJK tokenizer support Dgraph dgraph , kind:enhancement	1	450	July 8, 2020

Anyofterms doesn’t work as expected with Chinese characters

General Reason Why It Doesn’t Work

Specific Reason Why It Doesn’t Work Right Now

A (Temporary) Solution

Related topics