Anyofterms doesn't work as expected with Chinese characters


#1

Hi there,
By my understanding, anyofterms(name, “Name1 Name2”) will list all nodes with “Name1” and “Name2”.
If it’s correct, anyofterms(name, “名字1 名字2” should list all nodes with "名字1” and “名字2”, but it lists the nodes with “名” and “字”. Is this the designed behavior? Expecting to get some responses, thanks.
BTW, there’s no influence if i change the name to name@zh or not.


(Michel Conrado (Support Engineer)) #2

hmmm curious. Perhaps this is because Dgraph understands that each Logogram would be considered a distinct word. And the grammar of a system like Chinese is very different from the Western one. It would need to be analyzed if that is the case indeed.

@gus can you check this?


#3

OK, I’ll wait for the updates. Thanks


(Gus) #4

The term index will tokenize and normalize the terms added. So in the case of “名字1 名字2” when tokenized the tokens will be “1, 2, 名, 字”. The reason is that we are using unicode segmentation. We are considering changing this for Chinese and Japanese. I’ll let you know.


(Gus) #5

The new release of Dgraph 1.0.11 has improved support for language tokenization. Unfortunately, the handling of these words is not optimal. I’m opening an issue to work on that more.

But in 1.0.11 you can do:

# schema
word: string @index(term) @lang .

# mutate
{set{
   _:x1 name "名字1"@zh .
   _:x2 name "名字2"@zh .
}}

# query
{
   q(func: anyofterms(name@., "<名字1 名字2>")) {
     name@.
   }
}

Try that and see if it works for you.


Can't find record use alloftext or allofterms but eq fun is working
#6

Thanks for your update. I tired in v1.0.11-rc4, it works.
BTW, to make it clear for new reader, the language “@zh” should be specified when doing the mutation such as _:x1 name “名字1”@zh .