The allofterms fuzzy search produces incorrect results

lee.chen · March 12, 2021, 7:01am

Dgraph: v20.11.2

The allofterms fuzzy search produces incorrect results .

Scenario as follows：
Schema:

		apple_movietrailer_id: string .
		fandango_id: string .
		initial_release_date: datetime @index(year) .
		metacritic_id: string .
		name: string @index(hash, term, trigram, fulltext) @lang .
		netflix_id: string .
		prequel: [uid] .
		rottentomatoes_id: string .
		actor.dubbing_performances: [uid] .
		rating: [uid] @reverse .
		country: [uid] @reverse .
		rated: [uid] @reverse .

		type Film {
			apple_movietrailer_id: string
			fandango_id: string
			initial_release_date: dateTime
			metacritic_id: string
			name: string
			netflix_id: string
			prequel: [Film]
			rottentomatoes_id: string
		}
		
		type Actor {
			name: string
			actor.dubbing_performances: [Film]
		}

When I load the following data query

	{
		set{
			_:a <name> "Jackie Chan"@en .
			_:b <name> "Jet Li"@en .
			_:c <name> "Bruce Lee"@en .
			_:a <name> "成龙"@cn .
			_:b <name> "李连杰"@cn .
			_:c <name> "李小龙"@cn .

			_:a <dgraph.type> "Actor" .
			_:b <dgraph.type> "Actor" .
			_:c <dgraph.type> "Actor" .
	
		}
	}

The results were correct：

// query
{
		var(func: allofterms( <name>@., "成龙"  ))@filter( (allofterms(<name>@., "成龙")) and type( <Actor>)){
		   uid0 as uid
		}
		statistics(func: uid(uid0)){count(uid)}
		q(func: uid(uid0), first:40,offset:0){
		   dgraphType:dgraph.type
		   expand(_all_)
		 }
	  }

// results
{"statistics":[{"count":1}],"q":[{"dgraphType":["Actor"],"name@en":"Jackie Chan","name@cn":"成龙"}]}

But
When I load the following data query:

	{
		set{
			_:a <name> "Jackie Chan"@en .
			_:b <name> "Jet Li"@en .
			_:c <name> "Bruce Lee"@en .
			_:a <name> "成龙"@zh .
			_:b <name> "李连杰"@zh .
			_:c <name> "李小龙"@zh .

			_:a <dgraph.type> "Actor" .
			_:b <dgraph.type> "Actor" .
			_:c <dgraph.type> "Actor" .
	
		}
	}

The results were correct：

// query
{
		var(func: allofterms( <name>@., "成龙"  ))@filter( (allofterms(<name>@., "成龙")) and type( <Actor>)){
		   uid0 as uid
		}
		statistics(func: uid(uid0)){count(uid)}
		q(func: uid(uid0), first:40,offset:0){
		   dgraphType:dgraph.type
		   expand(_all_)
		 }
	  }

// results
{"statistics":[{"count":0}],"q":[]}

Is that a bug? I just replaced cn with zh?

chewxy · March 15, 2021, 2:10am

This is a known issue and work has been done on it. The solution it turned out was to use a custom tokenizer for CJK languages. It hasn’t yet been merged into the mainline. I’ll get it done this week

Also, yes, you should use zh instead of cn.

lee.chen · March 15, 2021, 2:20am

Thank you very much.

03B037 · March 19, 2021, 1:54am

@chewxy How is the job going and how can I receive the latest news? I’m concerned about that, too.

lee.chen · April 6, 2021, 2:28am

Hello, does this problem still exist in Dgraph v20.11.3？

chewxy · April 6, 2021, 2:35am

yes it still exists. I built a new tokenizer but I have not yet incorporated it into the main repo

lee.chen · July 6, 2021, 6:51am

Excuse me, have you made any progress on this

Topic		Replies	Views
@lang not indexing correctly, breaking `anyofterms` and `allofterms` Dgraph status:accepted , kind:bug , ticket:created , tokenization	7	1286	January 18, 2021
Anyofterms doesn't work as expected with Chinese characters Dgraph	5	1098	December 3, 2018
Anyofterms doesn’t work as expected with Chinese characters Dgraph i18n , tokenization , unicode	10	1068	January 13, 2021
Can't find record use alloftext or allofterms but eq fun is working Dgraph	4	1090	December 13, 2018
String matching in Dgraph v0.7.5 - Dgraph Blog Blog	0	1210	August 18, 2017

The allofterms fuzzy search produces incorrect results

Related topics