Anyofterms doesn’t work as expected with Chinese characters

Hi,I had the same problem,I found that Anyofterms can’t be used normally in Chinese.I did the same thing, and used @zh .but it didn’t work very well.

https://discuss.dgraph.io/t/anyofterms-doesnt-work-as-expected-with-chinese-characters/3581/5?u=soultrans

Hey @Soultrans,

Can you provide the dgraph version that you’re using and an example that’s not working for you?

I’m using version 20.03.

Examples are as follows:

{
all(func: eq(class, “比赛日程”)){
class
related@filter(has(game_type) and anyofterms(game_class@zh ,"<第一轮 第二轮>") ) {
uid
game_class
content
}
}
}

I want “第一轮” or “第二轮” information.However, the result will return the words “第”,“第一”,“第二”,“轮”…

Hi, can you please provide a small dataset and schema as well to help us easily reproduce this? Also, have you tried running it on latest release “Shuri”?

General Reason Why It Doesn’t Work

This seems to be a word boundary issue. In English, words are bounded by whitespace so that “hello world” is two words. However in Chinese, word boundary is a Hard Problem (like you need proper machine learning to solve it).

“第一轮” translates to “round one” in English. In English it’s 2 words, "round" and "one" In Chinese, it’s three characters, each of which may or may not be a word.

So when you write anyofterms(game_class@zh, "第一轮 第二轮") it’s equivalent of writing in English: anyofterms(game_class, "round one round two"), which will be uniquified to be anyofterms(game_class, “round one two”)`. This is fine for English. It’s not fine for Chinese.

Specific Reason Why It Doesn’t Work Right Now

Specifically, the Bleve tokenizer that Dgraph uses does not do language based segmentation. So after passing the query through the tokenizer and then uniquifying the results, we get the equivalent of running anyofterms(game_class@zh, "第 一 轮 二")

A (Temporary) Solution

Here I would like to make an observation. You seem to be searching for a particular pattern that matches a Chinese phrase - 第.+轮. So maybe try func: regexp(name@zh, /^第.+轮*$/) and see if that helps?

3 Likes

I figured Dgraph is doing language-specific term indexing? Correct me if I’m wrong, but this is set up in bleve.go:

Not really. The code there has no reference to language specific analysis. According to their docs, what we’re doing in DefineTokenFilter and DefineAnalyzer is just to first do lowercase then unicode normalization.

Language specific tokenization is done in other analyzers in Dgraph, but not for the things that anyofterms or allofterms hits.

Can you look into this a bit deeper and see if we should simplify the way we do tokenization @chewxy?

2 posts were split to a new topic: Lang@ not indexing correctly

Closing this issue as there was a fix made (though the fix may have caused a second set of issues - see the split topic.)