@lang not indexing correctly, breaking `anyofterms` and `allofterms`

Hi,I found language support for term tokenization in Dgraph v20.11.0 Release, and then I started dgraph using docker , And started my test, and found that the query related to term in v20.11.0 is more unfriendly to Chinese-no results.

The docker-compose file is as follows:


version: "2.3"

services:

  zero:

    image: dgraph/dgraph:v20.11.0

    volumes: 

      - ./dgraph_data:/dgraph

    ports:

      - 5280:5080

      - 6280:6080

    restart: on-failure

    command: dgraph zero --my=zero:5080

  server:

    image: dgraph/dgraph:v20.11.0

    volumes: 

      - ./dgraph_data:/dgraph

    ports:

      - 8280:8080

      - 9280:9080

    restart: on-failure

    command: dgraph alpha --whitelist 0.0.0.0:255.255.255.255 --my=server:7080 --lru_mb=20480 --zero=zero:5080 --postings out/0/p

  ratel:

    image: dgraph/dgraph:v20.11.0

    volumes:

      - ./dgraph_data:/dgraph

    ports:

      - 8200:8000

    command: dgraph-ratel

After startup, import test data


curl -H "Content-Type: application/rdf" "192.168.31.131:8280/mutate?commitNow=true" -XPOST -d $'

{

  set {

   _:diyijie <title@zh> "第一届三中全会议纪要" .

   _:diyijie <dgraph.type> "Paper" .

   _:dierjie <title@zh> "第二届三中全会议纪要" .

   _:dierjie <dgraph.type> "Paper" .

   _:disanjie <title@zh> "第三届三中全会议纪要" .

   _:disanjie <dgraph.type> "Paper" .

   _:diyierjie <title@zh> "第一届第二届三中全会议纪要" .

   _:diyierjie <dgraph.type> "Paper" .

   _:yimadangxian <title@zh> "一马当先的由来" .

   _:yimadangxian <dgraph.type> "Paper" .

   _:huanjie <title@zh> "换届的影响" .

   _:huanjie <dgraph.type> "Paper" .

  }

}

' | python -m json.tool | less

create a new schema


curl "192.168.31.131:8280/alter" -XPOST -d $'

  title: string @index(hash,term,trigram,fulltext) @lang .

  type Paper {

    title

  }

' | python -m json.tool | less

DQL


{

  me(func:anyofterms(title@zh, "第一届")) {

    title@zh

    uid

    dgraph.type

  }

}

query result


# v20.11.0

 {

  "data": {

    "me": []

  }

}
# v20.07.0

{

  "data": {

    "me": [

      {

        "title@zh": "一马当先的由来",

        "uid": "0x2711",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "换届的影响",

        "uid": "0x2712",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第一届三中全会议纪要",

        "uid": "0x2713",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第二届三中全会议纪要",

        "uid": "0x2714",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第三届三中全会议纪要",

        "uid": "0x2715",

        "dgraph.type": [

          "Paper"

        ]

      },

      {

        "title@zh": "第一届第二届三中全会议纪要",

        "uid": "0x2716",

        "dgraph.type": [

          "Paper"

        ]

      }

    ]

  }

}

It can be seen that v20.11.0 does not have any query results. Although the final result of v20.07.0 has a different design concept, there are at least relevant results.

In fact, I want to use term to achieve the function of fuzzy search in Chinese, but now v20.11.0 can’t search for any results.

I want to ask, did I do something wrong?

One second. I’m trying to replicate this.

I have confirmed that this does not work in T’challa.

Shuri (the results are wrong):

T’challa (no results, even worse!):

I further checked with different languages:

I then added book titles in other languages and tested them

List of Experiments

Data lang@ Query snippet Language OK?
第一届三中全会议纪要 title@zh anyofterms(title@zh, "第一届") Zh No
第一届三中全会议纪要 title@zh allofterms(title@zh, "第一届三中全会议纪要") Zh No
第一届三中全会议纪要 title@zh anyofterms(title@zh, "第一届三中全会议纪要") Zh No
Der Tod in Venedig title@de anyofterms(title@de, "Der") De Yes
海辺のカフカ title@ja allofterms(title@ja, "海辺のカフカ") Ja No
海辺のカフカ title@ja anyofterms(title@ja, "海辺") Ja No
ฟ้าใหม่ title@th anyofterms(title@th, " ฟ้าใหม่") Th No
엄마를 부탁해1 title@ko func:anyofterms(title@ko, "엄마를") Ko Yes
엄마를부탁해 title@ko func:anyofterms(title@ko, "엄마를") Ko Yes (it’s not supposed to have results)

Where Is The Source of the Problem

Having tested these languages, it would appear that the tokenization process is somewhat broken for languages that do not use whitespace as a delimiter.

The prime candidates are:

  1. Commit 4b2d50b
    Relevant source lines: https://github.com/dgraph-io/dgraph/blob/master/tok/tok.go#L292-L295
  2. Commit bdea6d
    This is the one where I upgraded our tokenizer, Bleve.

Both of these commits are mine, so I shall be looking into this.

What Happens Next

I’ll go do a git bisect to find out what broke. Once that’s done I shall post a fix into a PR. I’ll then compile a new version for you to test.

ETA is roughly 1.5 days.

What Will Be Done to Prevent This in the Future

The main issue is that both commits pass the tests. This indicates that our tests are not good enough. Perhaps some end-to-end testing with language tags would be required. More on this.

2 Likes

Thank you very much,hope to receive your good news soon

A quick update. I have figured out what’s happening. In effect, when indexing a value, languages that do not use space as a delimiter, space is used as a delimiter anyway. Thus “第一届三中全会议纪要” is considered one word, whereas a chinese speaker would probably tokenize it thus:

  • 第一届
  • 中全会议
  • 纪要

The proper fix for this is larger than I expected. Give me a couple of weeks.

Hello, I think whether it is possible to change the way of thinking, if it is in other languages (not with spaces as separated words, such as Chinese) we also require the use of spaces to separate when querying in term, and then directly in the graph database Inquire. Back-end query logic, you don’t need to separate the related predicate values in the database according to the word first, we can directly match the word according to the input need to match the search directly, in this way, there is no need to segment the related predicate value, and the word segmentation of different languages There are still some gaps in logic. Chinese word segmentation is also more difficult, and the workload involved is relatively large, and query efficiency may also decrease as a result; what do you think?

In the meantime, there is a temporary solution: use fulltext indexing for the fields, and search using anyoftext or alloftext.

I tried it, but it didn’t work