Anyofterms doesn't work as expected with Chinese characters

peterZ · November 12, 2018, 7:23am

Hi there,
By my understanding, anyofterms(name, “Name1 Name2”) will list all nodes with “Name1” and “Name2”.
If it’s correct, anyofterms(name, “名字1 名字2” should list all nodes with "名字1” and “名字2”, but it lists the nodes with “名” and “字”. Is this the designed behavior? Expecting to get some responses, thanks.
BTW, there’s no influence if i change the name to name@zh or not.

MichelDiz · November 12, 2018, 4:09pm

hmmm curious. Perhaps this is because Dgraph understands that each Logogram would be considered a distinct word. And the grammar of a system like Chinese is very different from the Western one. It would need to be analyzed if that is the case indeed.

@gus can you check this?

peterZ · November 15, 2018, 1:48am

OK, I’ll wait for the updates. Thanks

gus · November 15, 2018, 2:45am

The term index will tokenize and normalize the terms added. So in the case of “名字1 名字2” when tokenized the tokens will be “1, 2, 名, 字”. The reason is that we are using unicode segmentation. We are considering changing this for Chinese and Japanese. I’ll let you know.

gus · November 29, 2018, 11:09pm

The new release of Dgraph 1.0.11 has improved support for language tokenization. Unfortunately, the handling of these words is not optimal. I’m opening an issue to work on that more.

But in 1.0.11 you can do:

# schema
word: string @index(term) @lang .

# mutate
{set{
   _:x1 name "名字1"@zh .
   _:x2 name "名字2"@zh .
}}

# query
{
   q(func: anyofterms(name@., "<名字1 名字2>")) {
     name@.
   }
}

Try that and see if it works for you.

peterZ · December 3, 2018, 2:57am

Thanks for your update. I tired in v1.0.11-rc4, it works.
BTW, to make it clear for new reader, the language “@zh” should be specified when doing the mutation such as _:x1 name “名字1”@zh .

Topic		Replies	Views
Anyofterms doesn’t work as expected with Chinese characters Dgraph i18n , tokenization , unicode	10	1016	January 13, 2021
Using anyofterms with + characters leads to unexpected results Dgraph	3	458	October 7, 2020
Improve CJK tokenizer support Dgraph dgraph , kind:enhancement	1	450	July 8, 2020
@lang not indexing correctly, breaking `anyofterms` and `allofterms` Dgraph status:accepted , kind:bug , ticket:created , tokenization	7	1242	January 18, 2021
Term search problem when use `-` in string data Dgraph kind:question , dgraph	2	414	March 9, 2021

Anyofterms doesn't work as expected with Chinese characters

Related topics