Anyofterms doesn't work as expected with Chinese characters

Hi there,
By my understanding, anyofterms(name, “Name1 Name2”) will list all nodes with “Name1” and “Name2”.
If it’s correct, anyofterms(name, “名字1 名字2” should list all nodes with "名字1” and “名字2”, but it lists the nodes with “名” and “字”. Is this the designed behavior? Expecting to get some responses, thanks.
BTW, there’s no influence if i change the name to name@zh or not.

hmmm curious. Perhaps this is because Dgraph understands that each Logogram would be considered a distinct word. And the grammar of a system like Chinese is very different from the Western one. It would need to be analyzed if that is the case indeed.

@gus can you check this?

OK, I’ll wait for the updates. Thanks

The term index will tokenize and normalize the terms added. So in the case of “名字1 名字2” when tokenized the tokens will be “1, 2, 名, 字”. The reason is that we are using unicode segmentation. We are considering changing this for Chinese and Japanese. I’ll let you know.

2 Likes

The new release of Dgraph 1.0.11 has improved support for language tokenization. Unfortunately, the handling of these words is not optimal. I’m opening an issue to work on that more.

But in 1.0.11 you can do:

# schema
word: string @index(term) @lang .

# mutate
{set{
   _:x1 name "名字1"@zh .
   _:x2 name "名字2"@zh .
}}

# query
{
   q(func: anyofterms(name@., "<名字1 名字2>")) {
     name@.
   }
}

Try that and see if it works for you.

1 Like

Thanks for your update. I tired in v1.0.11-rc4, it works.
BTW, to make it clear for new reader, the language “@zh” should be specified when doing the mutation such as _:x1 name “名字1”@zh .