How to use regexp in @filter

I had about 50million nodes,if I use ‘regexp’ like

“{
resources(func:regexp(name, /abc/i) )@filter(eq(workspace_key, "def")) {
id
name
resource_key
}
}”,
it will be fast with in 1 second,however,
if I use it like "{
resources(func: eq(workspace_key, "def"))@filter(regexp(name, /abc/i)) {
id
name
resource_key
}
}",

it will be useless and overtime. Most of the time I have to spell out a complex logical combination, so it’s impossible to put a regexp in “func” but filter.
So I really wanna know whether the index is invalid when I use ‘regexp’ in filter?
And if somebody has a good idea to solve this problem?

Hey @qwe1593575564,

Could you post your schema (or a subset thereof that illustrates the directives you’re applying to name and workspace_key)?

(post deleted by author)

I set “name” with “trigram” and “term”, “workspace_key” with “exact”
I hope that I can filter data from 50million nodes,however, only 1 million nodes,I can’t use search dql.

  1. If I use dql like ‘{
    resources(func: eq(workspace_key, “_xGWV7”))@filter(regexp(name, /EOR/i)) {
    id
    name
    resource_key
    }
    }’

its return is ‘{
“data”: {
“resources”: [
{
“id”: “3IawkShKWcAib”,
“name”: “EORandCPUandLow”,
“resource_key”: “scene”
},
{
“id”: “rCdURqC0RXdjF”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
},
{
“id”: “Sk9QT03MaVW4b”,
“name”: “EORCPU”,
“resource_key”: “scene”
},
{
“id”: “LMrPKhCkSRomv”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
}
]
},
“extensions”: {
“server_latency”: {
“parsing_ns”: 72705,
“processing_ns”: 10924025523,
“encoding_ns”: 103204,
“assign_timestamp_ns”: 650475,
“total_ns”: 10924935738
},
“txn”: {
“start_ts”: 47708734
},
“metrics”: {
“num_uids”: {
“_total”: 1021224,
“id”: 4,
“name”: 1021216,
“resource_key”: 4,
“workspace_key”: 0
}
}
}
}’
,the search time more than 15s, sometime it will be overtime

2.If I use dql like ‘{
resources(func: regexp(name, /EOR/i))@filter(eq(workspace_key, “_xGWV7”)) {
id
name
resource_key
}
}’,

its return is '{
“data”: {
“resources”: [
{
“id”: “3IawkShKWcAib”,
“name”: “EORandCPUandLow”,
“resource_key”: “scene”
},
{
“id”: “rCdURqC0RXdjF”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
},
{
“id”: “Sk9QT03MaVW4b”,
“name”: “EORCPU”,
“resource_key”: “scene”
},
{
“id”: “LMrPKhCkSRomv”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
}
]
},
‘{
“data”: {
“resources”: [
{
“id”: “3IawkShKWcAib”,
“name”: “EORandCPUandLow”,
“resource_key”: “scene”
},
{
“id”: “rCdURqC0RXdjF”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
},
{
“id”: “Sk9QT03MaVW4b”,
“name”: “EORCPU”,
“resource_key”: “scene”
},
{
“id”: “LMrPKhCkSRomv”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
}
]
},
“extensions”: {
“server_latency”: {
“parsing_ns”: 98835,
“processing_ns”: 65669294,
“encoding_ns”: 110217,
“assign_timestamp_ns”: 828831,
“total_ns”: 66795915
},
“txn”: {
“start_ts”: 47823833
},
“metrics”: {
“num_uids”: {
“_total”: 200,
“id”: 4,
“name”: 4,
“resource_key”: 4,
“workspace_key”: 188
}
}
}
}’
And search time is only 100ms!

So I hope that if there is any relevant solution, please reply, thank you!

You might have some nodes without a name attribute, thus making it faster filtering name first.
What result do you get for the following query?

{
    nameCount(func: has(name)){
        count(uid)
    }
    
    keyCount(func: has(workspace_key)){
        count(uid)
    }
}

A good tactic to use in these situations is to use multiple query blocks with vars. See: Multiple Query Blocks with DQL - Query language

Basically the idea would be to first create a uid set of for your regexp results, then in a final block, filter the var results by workspace_key. That might be a solution, hard to know without actual data.

I tried this, but I just gave a simple example.
Firstly, if I give a not regexp, this method will be difficult again.
Secondly,if I match too many uids with a regexp,I can’t use this var results in finnal block.
Thirdly,as I said above, I just gave a simple demo. In fact, multiple regular expressions, equals, or other comparisons will be used. Maybe I will write the dql is :

{
m (func: eq(resource_key, "requirement"))@filter(eq(workspace_key,"_xGWV7")  AND( (( eq(is_for_commercial,true) OR eq(is_for_commercial,true) )AND gt(total_autocase_number,10)  AND ( regexp(requirement_purpose,/use/i)  AND ge(total_testcase_number,50) AND lt(total_testcase_number,1001)) )  OR ((eq(requirement_delivery_type,"1") OR eq(requirement_delivery_status,"6"))AND regexp(qa,/wANL/i) AND NOT regexp(tse,/zeng/i) AND regexp(requirement_id,/R72/i) AND regexp(market_requirement_id,/xGW7/i) and regexp(requirement_delivery_sm,/zhang/i) AND NOT eq(tse,"chase")) OR (regexp(testcase_url,/https:\/\/www.baidu.com.\/rdcloud\/workspaces\/xGW7\/apps\/test\/testlib\/xGW7/i) AND regexp(qa,/wanl/i)) AND NOT eq(total_autocase_number,0) AND NOT eq(total_testcase_number,0) AND eq(requirement_delivery_status,"6") )
  OR( regexp(name,/SMF/i) AND (regexp(tse,/xx/i) OR regexp(tse,/bb/i))AND (regexp(requirement_delivery_sm,/cxb/i) OR regexp(requirement_delivery_sm,/wuxj/i))AND regexp(requirement_id,/R7.2/i)) 
  ) @recurse(depth:20){
  id
  name
  resourceKey : resource_key
  comment
  total_autocase_number
  total_testcase_number
  is_for_commercial
  market_requirement_id
  requirement_delivery_sm
  requirement_delivery_status
  requirement_delivery_type
  requirement_id
  requirement_purpose
  target_market
  tse
  delivery_report_url
  qa
  solution_specification_url
  test_design_url
  testcase_url

	parent :~parent_to_child @facets
  }
}

And here is All schema

 `<_basic_info>: string @index(term, trigram) .
<attachment>: [uid] .
<chafen>: string .
<comment>: string @index(term, trigram) .
<create_by>: string @index(term, trigram) .
<create_time>: datetime @index(hour) .
<creator>: string .
<data>: uid .
<delivery_report_url>: string @index(term, trigram) .
<description>: string @index(term, trigram) .
<dgraph.drop.op>: string .
<dgraph.graphql.p_query>: string @index(sha256) .
<dgraph.graphql.schema>: string .
<dgraph.graphql.xid>: string @index(exact) @upsert .
<full_key>: string .
<gongnengshuoming>: string .
<id>: string @index(term) .
<is_for_commercial>: bool @index(bool) .
<last_update_by>: string @index(term, trigram) .
<last_update_time>: datetime @index(hour) .
<library_id>: string @index(term) .
<market_requirement_id>: string @index(term, trigram) .
<modifier>: string .
<name>: string @index(term, trigram) .
<op>: string .
<parent_to_child>: [uid] @reverse .
<peizhi>: string .
<qa>: string @index(term, trigram) .
<rdc>: string .
<rdc_areapath>: string .
<rdc_belong_product>: string .
<rdc_released_version>: string .
<rdc_requirement_priority>: string .
<rdc_workitem_type>: string @index(term, trigram) .
<relation>: [uid] @reverse .
<requirement_apply_location>: string @index(term, trigram) .
<requirement_delivery_sm>: string @index(term, trigram) .
<requirement_delivery_status>: string @index(term, trigram) .
<requirement_delivery_type>: string @index(term, trigram) .
<requirement_id>: string @index(term, trigram) .
<requirement_purpose>: string @index(term, trigram) .
<requirement_specification_url>: string @index(term, trigram) .
<requirement_time>: string @index(term, trigram) .
<requirement_title>: string @index(term, trigram) .
<requirement_version>: string @index(term, trigram) .
<resource_key>: string .
<rich_text>: string @index(term, trigram) .
<scene_factor_keyword>: string @index(term, trigram) .
<scene_mml>: string @index(term, trigram) .
<scene_test_check_point>: string @index(term, trigram) .
<scene_test_step>: string @index(term, trigram) .
<secondType>: string .
<shangyong>: string .
<shengxiaojiancha>: string .
<signaling_stream>: string @index(term, trigram) .
<solution_specification_url>: string @index(term, trigram) .
<target_market>: string @index(term, trigram) .
<test_design_url>: string @index(term, trigram) .
<testcase_url>: string @index(term, trigram) .
<third_party_id>: string .
<third_party_key>: string .
<third_party_url>: string @index(term, trigram) .
<title>: string @index(term, trigram) .
<total_autocase_number>: int @index(int) .
<total_testcase_number>: int @index(int) .
<tse>: string @index(term, trigram) .
<type>: string .
<version>: string .
<workspace_key>: string @index(exact) .
type <dgraph.graphql> {
	dgraph.graphql.schema
	dgraph.graphql.xid
}
type <dgraph.graphql.persisted_query> {
	dgraph.graphql.p_query
}`

In my database, every nodes have name and workspace_key.

Thanks, que, for adding this as a bug also: [How to use regexp in @filter]: <regexp use> · Issue #8996 · dgraph-io/dgraph · GitHub . We need to fix this to make things easy to use.

Meantime, I suggest using some rewriter as a workaround. How do you currently build the DQL, and is there an intermediate form of the query?

Parsing the DQL is harder but possible (simply find regexp(…) in the filter and swap it with func:somefunc(…) in the root query in text. If you post that script to the community it would be helpful.

If you have a query layer above with a list of conditions or an AND/OR tree structure or similar, it would be easier to programmatically ensure the longest regexp is used as func:regexp() at root.

Thanks for your reply. We are currently using the regexp result as the intermediate form, but if the intermediate variable produces too many results, it will still be invalid in the final query.
So when will this bug be fixed?