@lang not indexing correctly, breaking `anyofterms` and `allofterms`

buster · January 12, 2021, 9:29am

Hi,I found language support for term tokenization in Dgraph v20.11.0 Release, and then I started dgraph using docker , And started my test, and found that the query related to term in v20.11.0 is more unfriendly to Chinese-no results.

The docker-compose file is as follows:


version: "2.3"

services:

  zero:

    image: dgraph/dgraph:v20.11.0

    volumes: 

      - ./dgraph_data:/dgraph

    ports:

      - 5280:5080

      - 6280:6080

    restart: on-failure

    command: dgraph zero --my=zero:5080

  server:

    image: dgraph/dgraph:v20.11.0

    volumes: 

      - ./dgraph_data:/dgraph

    ports:

      - 8280:8080

      - 9280:9080

    restart: on-failure

    command: dgraph alpha --whitelist 0.0.0.0:255.255.255.255 --my=server:7080 --lru_mb=20480 --zero=zero:5080 --postings out/0/p

  ratel:

    image: dgraph/dgraph:v20.11.0

    volumes:

      - ./dgraph_data:/dgraph

    ports:

      - 8200:8000

    command: dgraph-ratel

After startup, import test data


curl -H "Content-Type: application/rdf" "192.168.31.131:8280/mutate?commitNow=true" -XPOST -d $'

{

  set {

   _:diyijie <title@zh> "第一届三中全会议纪要" .

   _:diyijie <dgraph.type> "Paper" .

   _:dierjie <title@zh> "第二届三中全会议纪要" .

   _:dierjie <dgraph.type> "Paper" .

   _:disanjie <title@zh> "第三届三中全会议纪要" .

   _:disanjie <dgraph.type> "Paper" .

   _:diyierjie <title@zh> "第一届第二届三中全会议纪要" .

   _:diyierjie <dgraph.type> "Paper" .

   _:yimadangxian <title@zh> "一马当先的由来" .

   _:yimadangxian <dgraph.type> "Paper" .

   _:huanjie <title@zh> "换届的影响" .

   _:huanjie <dgraph.type> "Paper" .

  }

}

' | python -m json.tool | less

create a new schema


curl "192.168.31.131:8280/alter" -XPOST -d $'

  title: string @index(hash,term,trigram,fulltext) @lang .

  type Paper {

    title

  }

' | python -m json.tool | less

DQL


{

  me(func:anyofterms(title@zh, "第一届")) {

    title@zh

    uid

    dgraph.type

  }

}

query result


# v20.11.0

 {

  "data": {

    "me": []

  }

}
# v20.07.0

{

  "data": {

    "me": [

      {

        "title@zh": "一马当先的由来",

        "uid": "0x2711",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "换届的影响",

        "uid": "0x2712",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第一届三中全会议纪要",

        "uid": "0x2713",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第二届三中全会议纪要",

        "uid": "0x2714",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第三届三中全会议纪要",

        "uid": "0x2715",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第一届第二届三中全会议纪要",

        "uid": "0x2716",

        "dgraph.type": [

          "Paper"

        ]

      }

    ]

  }

}

It can be seen that v20.11.0 does not have any query results. Although the final result of v20.07.0 has a different design concept, there are at least relevant results.

In fact, I want to use term to achieve the function of fuzzy search in Chinese, but now v20.11.0 can’t search for any results.

I want to ask, did I do something wrong?

chewxy · January 13, 2021, 1:39am

One second. I’m trying to replicate this.

chewxy · January 13, 2021, 2:40am

I have confirmed that this does not work in T’challa.

Shuri (the results are wrong):

T’challa (no results, even worse!):

I further checked with different languages:

I then added book titles in other languages and tested them

List of Experiments

Data	lang@	Query snippet	Language	OK?
第一届三中全会议纪要	title@zh	`anyofterms(title@zh, "第一届")`	Zh	No
第一届三中全会议纪要	title@zh	`allofterms(title@zh, "第一届三中全会议纪要")`	Zh	No
第一届三中全会议纪要	title@zh	`anyofterms(title@zh, "第一届三中全会议纪要")`	Zh	No
Der Tod in Venedig	title@de	`anyofterms(title@de, "Der")`	De	Yes
海辺のカフカ	title@ja	`allofterms(title@ja, "海辺のカフカ")`	Ja	No
海辺のカフカ	title@ja	`anyofterms(title@ja, "海辺")`	Ja	No
ฟ้าใหม่	title@th	`anyofterms(title@th, " ฟ้าใหม่")`	Th	No
엄마를 부탁해1	title@ko	`func:anyofterms(title@ko, "엄마를")`	Ko	Yes
엄마를부탁해	title@ko	`func:anyofterms(title@ko, "엄마를")`	Ko	Yes (it’s not supposed to have results)

Where Is The Source of the Problem

Having tested these languages, it would appear that the tokenization process is somewhat broken for languages that do not use whitespace as a delimiter.

The prime candidates are:

Commit 4b2d50b
Relevant source lines: dgraph/tok/tok.go at master · dgraph-io/dgraph · GitHub
Commit bdea6d
This is the one where I upgraded our tokenizer, Bleve.

Both of these commits are mine, so I shall be looking into this.

What Happens Next

I’ll go do a git bisect to find out what broke. Once that’s done I shall post a fix into a PR. I’ll then compile a new version for you to test.

ETA is roughly 1.5 days.

What Will Be Done to Prevent This in the Future

The main issue is that both commits pass the tests. This indicates that our tests are not good enough. Perhaps some end-to-end testing with language tags would be required. More on this.

buster · January 13, 2021, 3:58am

Thank you very much，hope to receive your good news soon

chewxy · January 18, 2021, 3:11am

A quick update. I have figured out what’s happening. In effect, when indexing a value, languages that do not use space as a delimiter, space is used as a delimiter anyway. Thus “第一届三中全会议纪要” is considered one word, whereas a chinese speaker would probably tokenize it thus:

第一届
三
中全会议
纪要

The proper fix for this is larger than I expected. Give me a couple of weeks.

buster · January 18, 2021, 3:27am

Hello, I think whether it is possible to change the way of thinking, if it is in other languages (not with spaces as separated words, such as Chinese) we also require the use of spaces to separate when querying in term, and then directly in the graph database Inquire. Back-end query logic, you don’t need to separate the related predicate values in the database according to the word first, we can directly match the word according to the input need to match the search directly, in this way, there is no need to segment the related predicate value, and the word segmentation of different languages There are still some gaps in logic. Chinese word segmentation is also more difficult, and the workload involved is relatively large, and query efficiency may also decrease as a result; what do you think?

chewxy · January 18, 2021, 4:15am

In the meantime, there is a temporary solution: use fulltext indexing for the fields, and search using anyoftext or alloftext.

buster · January 18, 2021, 8:40am

I tried it, but it didn’t work

Topic		Replies	Views
Anyofterms doesn't work as expected with Chinese characters Dgraph	5	1053	December 3, 2018
Missing data GraphQL dgraph , kind:bug	2	381	March 11, 2021
Moredata 1 page instructions do not work Issues untagged , tutorial	12	475	July 11, 2020
Errors running dgraph/dgraph:master with Docker Compose Dgraph graphql , kind:question	13	998	December 23, 2020
Adding GQL schema with lang support GraphQL	1	485	September 23, 2021

@lang not indexing correctly, breaking `anyofterms` and `allofterms`

List of Experiments

Where Is The Source of the Problem

What Happens Next

What Will Be Done to Prevent This in the Future

Related topics