How to use regexp in @filter

qwe1593575564 · September 9, 2023, 4:37pm

I had about 50million nodes，if I use ‘regexp’ like

“{
resources(func:regexp(name, /abc/i) )@filter(eq(workspace_key, "def")) {
id
name
resource_key
}
}”，
it will be fast with in 1 second,however,
if I use it like "{
resources(func: eq(workspace_key, "def"))@filter(regexp(name, /abc/i)) {
id
name
resource_key
}
}",

it will be useless and overtime. Most of the time I have to spell out a complex logical combination, so it’s impossible to put a regexp in “func” but filter.
So I really wanna know whether the index is invalid when I use ‘regexp’ in filter？
And if somebody has a good idea to solve this problem?

matthewmcneely · September 11, 2023, 8:42pm

Hey @qwe1593575564,

Could you post your schema (or a subset thereof that illustrates the directives you’re applying to name and workspace_key)?

qwe1593575564 · September 14, 2023, 12:43am

(post deleted by author)

qwe1593575564 · September 14, 2023, 11:03am

I set “name” with “trigram” and “term”, “workspace_key” with “exact”
I hope that I can filter data from 50million nodes,however, only 1 million nodes,I can’t use search dql.

If I use dql like ‘{
resources(func: eq(workspace_key, “_xGWV7”))@filter(regexp(name, /EOR/i)) {
id
name
resource_key
}
}’

its return is ‘{
“data”: {
“resources”: [
{
“id”: “3IawkShKWcAib”,
“name”: “EORandCPUandLow”,
“resource_key”: “scene”
},
{
“id”: “rCdURqC0RXdjF”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
},
{
“id”: “Sk9QT03MaVW4b”,
“name”: “EORCPU”,
“resource_key”: “scene”
},
{
“id”: “LMrPKhCkSRomv”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
}
]
},
“extensions”: {
“server_latency”: {
“parsing_ns”: 72705,
“processing_ns”: 10924025523,
“encoding_ns”: 103204,
“assign_timestamp_ns”: 650475,
“total_ns”: 10924935738
},
“txn”: {
“start_ts”: 47708734
},
“metrics”: {
“num_uids”: {
“_total”: 1021224,
“id”: 4,
“name”: 1021216,
“resource_key”: 4,
“workspace_key”: 0
}
}
}
}’
,the search time more than 15s, sometime it will be overtime

2.If I use dql like ‘{
resources(func: regexp(name, /EOR/i))@filter(eq(workspace_key, “_xGWV7”)) {
id
name
resource_key
}
}’,

its return is '{
“data”: {
“resources”: [
{
“id”: “3IawkShKWcAib”,
“name”: “EORandCPUandLow”,
“resource_key”: “scene”
},
{
“id”: “rCdURqC0RXdjF”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
},
{
“id”: “Sk9QT03MaVW4b”,
“name”: “EORCPU”,
“resource_key”: “scene”
},
{
“id”: “LMrPKhCkSRomv”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
}
]
},
‘{
“data”: {
“resources”: [
{
“id”: “3IawkShKWcAib”,
“name”: “EORandCPUandLow”,
“resource_key”: “scene”
},
{
“id”: “rCdURqC0RXdjF”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
},
{
“id”: “Sk9QT03MaVW4b”,
“name”: “EORCPU”,
“resource_key”: “scene”
},
{
“id”: “LMrPKhCkSRomv”,
“name”: “EORandCPUandHigh”,
“resource_key”: “scene”
}
]
},
“extensions”: {
“server_latency”: {
“parsing_ns”: 98835,
“processing_ns”: 65669294,
“encoding_ns”: 110217,
“assign_timestamp_ns”: 828831,
“total_ns”: 66795915
},
“txn”: {
“start_ts”: 47823833
},
“metrics”: {
“num_uids”: {
“_total”: 200,
“id”: 4,
“name”: 4,
“resource_key”: 4,
“workspace_key”: 188
}
}
}
}’
And search time is only 100ms!

So I hope that if there is any relevant solution, please reply, thank you!

vnium · September 14, 2023, 4:29pm

You might have some nodes without a name attribute, thus making it faster filtering name first.
What result do you get for the following query?

{
    nameCount(func: has(name)){
        count(uid)
    }
    
    keyCount(func: has(workspace_key)){
        count(uid)
    }
}

matthewmcneely · September 14, 2023, 7:45pm

A good tactic to use in these situations is to use multiple query blocks with vars. See: Multiple Query Blocks with DQL - Query language

Basically the idea would be to first create a uid set of for your regexp results, then in a final block, filter the var results by workspace_key. That might be a solution, hard to know without actual data.

qwe1593575564 · September 15, 2023, 12:47am

I tried this, but I just gave a simple example.
Firstly, if I give a not regexp, this method will be difficult again.
Secondly,if I match too many uids with a regexp,I can’t use this var results in finnal block.
Thirdly,as I said above, I just gave a simple demo. In fact, multiple regular expressions, equals, or other comparisons will be used. Maybe I will write the dql is :

{
m (func: eq(resource_key, "requirement"))@filter(eq(workspace_key,"_xGWV7")  AND( (( eq(is_for_commercial,true) OR eq(is_for_commercial,true) )AND gt(total_autocase_number,10)  AND ( regexp(requirement_purpose,/use/i)  AND ge(total_testcase_number,50) AND lt(total_testcase_number,1001)) )  OR ((eq(requirement_delivery_type,"1") OR eq(requirement_delivery_status,"6"))AND regexp(qa,/wANL/i) AND NOT regexp(tse,/zeng/i) AND regexp(requirement_id,/R72/i) AND regexp(market_requirement_id,/xGW7/i) and regexp(requirement_delivery_sm,/zhang/i) AND NOT eq(tse,"chase")) OR (regexp(testcase_url,/https:\/\/www.baidu.com.\/rdcloud\/workspaces\/xGW7\/apps\/test\/testlib\/xGW7/i) AND regexp(qa,/wanl/i)) AND NOT eq(total_autocase_number,0) AND NOT eq(total_testcase_number,0) AND eq(requirement_delivery_status,"6") )
  OR( regexp(name,/SMF/i) AND (regexp(tse,/xx/i) OR regexp(tse,/bb/i))AND (regexp(requirement_delivery_sm,/cxb/i) OR regexp(requirement_delivery_sm,/wuxj/i))AND regexp(requirement_id,/R7.2/i)) 
  ) @recurse(depth:20){
  id
  name
  resourceKey : resource_key
  comment
  total_autocase_number
  total_testcase_number
  is_for_commercial
  market_requirement_id
  requirement_delivery_sm
  requirement_delivery_status
  requirement_delivery_type
  requirement_id
  requirement_purpose
  target_market
  tse
  delivery_report_url
  qa
  solution_specification_url
  test_design_url
  testcase_url

	parent :~parent_to_child @facets
  }
}

qwe1593575564 · September 15, 2023, 12:48am

And here is All schema

 `<_basic_info>: string @index(term, trigram) .
<attachment>: [uid] .
<chafen>: string .
<comment>: string @index(term, trigram) .
<create_by>: string @index(term, trigram) .
<create_time>: datetime @index(hour) .
<creator>: string .
<data>: uid .
<delivery_report_url>: string @index(term, trigram) .
<description>: string @index(term, trigram) .
<dgraph.drop.op>: string .
<dgraph.graphql.p_query>: string @index(sha256) .
<dgraph.graphql.schema>: string .
<dgraph.graphql.xid>: string @index(exact) @upsert .
<full_key>: string .
<gongnengshuoming>: string .
<id>: string @index(term) .
<is_for_commercial>: bool @index(bool) .
<last_update_by>: string @index(term, trigram) .
<last_update_time>: datetime @index(hour) .
<library_id>: string @index(term) .
<market_requirement_id>: string @index(term, trigram) .
<modifier>: string .
<name>: string @index(term, trigram) .
<op>: string .
<parent_to_child>: [uid] @reverse .
<peizhi>: string .
<qa>: string @index(term, trigram) .
<rdc>: string .
<rdc_areapath>: string .
<rdc_belong_product>: string .
<rdc_released_version>: string .
<rdc_requirement_priority>: string .
<rdc_workitem_type>: string @index(term, trigram) .
<relation>: [uid] @reverse .
<requirement_apply_location>: string @index(term, trigram) .
<requirement_delivery_sm>: string @index(term, trigram) .
<requirement_delivery_status>: string @index(term, trigram) .
<requirement_delivery_type>: string @index(term, trigram) .
<requirement_id>: string @index(term, trigram) .
<requirement_purpose>: string @index(term, trigram) .
<requirement_specification_url>: string @index(term, trigram) .
<requirement_time>: string @index(term, trigram) .
<requirement_title>: string @index(term, trigram) .
<requirement_version>: string @index(term, trigram) .
<resource_key>: string .
<rich_text>: string @index(term, trigram) .
<scene_factor_keyword>: string @index(term, trigram) .
<scene_mml>: string @index(term, trigram) .
<scene_test_check_point>: string @index(term, trigram) .
<scene_test_step>: string @index(term, trigram) .
<secondType>: string .
<shangyong>: string .
<shengxiaojiancha>: string .
<signaling_stream>: string @index(term, trigram) .
<solution_specification_url>: string @index(term, trigram) .
<target_market>: string @index(term, trigram) .
<test_design_url>: string @index(term, trigram) .
<testcase_url>: string @index(term, trigram) .
<third_party_id>: string .
<third_party_key>: string .
<third_party_url>: string @index(term, trigram) .
<title>: string @index(term, trigram) .
<total_autocase_number>: int @index(int) .
<total_testcase_number>: int @index(int) .
<tse>: string @index(term, trigram) .
<type>: string .
<version>: string .
<workspace_key>: string @index(exact) .
type <dgraph.graphql> {
	dgraph.graphql.schema
	dgraph.graphql.xid
}
type <dgraph.graphql.persisted_query> {
	dgraph.graphql.p_query
}`

qwe1593575564 · September 15, 2023, 9:05am

In my database, every nodes have name and workspace_key.

Damon · September 15, 2023, 3:00pm

Thanks, que, for adding this as a bug also: [How to use regexp in @filter]: <regexp use> · Issue #8996 · dgraph-io/dgraph · GitHub . We need to fix this to make things easy to use.

Meantime, I suggest using some rewriter as a workaround. How do you currently build the DQL, and is there an intermediate form of the query?

Parsing the DQL is harder but possible (simply find regexp(…) in the filter and swap it with func:somefunc(…) in the root query in text. If you post that script to the community it would be helpful.

If you have a query layer above with a list of conditions or an AND/OR tree structure or similar, it would be easier to programmatically ensure the longest regexp is used as func:regexp() at root.

qwe1593575564 · September 20, 2023, 12:47am

Thanks for your reply. We are currently using the regexp result as the intermediate form, but if the intermediate variable produces too many results, it will still be invalid in the final query.
So when will this bug be fixed?

Topic		Replies	Views
Regexp search in graphql is not using index, and is doing a full scan, dql regexp is working properly GraphQL kind:bug	0	361	June 16, 2021
Regex to match @filter(...) Dgraph	4	544	August 15, 2020
Cannot filter by regexp GraphQL	2	790	November 16, 2020
Not to use trigrams when filtering by regexp Dgraph	8	923	January 18, 2019
How to filter accurately? Dgraph	0	271	September 15, 2023

How to use regexp in @filter

Related Topics