Not to use trigrams when filtering by regexp


(Nikita Zaletov) #1

it’s strange that for using regexp expression in filter section, still need to create trigram index.
it’s just filtering on already fetched data, not a search by some condition. when using it in filter section, it makes sense just to apply regular expression to a string value and filter out values that don’t match.

as for now, trigram index is needed, so need additional space to store it, and, what’s more important, is not possible to use regular expressions like "^f.*" to keep values starting with “f” letter only. whereas all nodes are already found by outcoming edges from some other node.

i found 1 workaround to add 2 junk letters to every string value i need to filter with “start with” condition, so "^f.*" condition becomes "^AAf.*" . it works but looks real ugly

so, my question is - why not using regular “regexp” go function when processing regular expression in “filter” section, and use trigram index when use it in “func” one?


(Michel Conrado (Support Engineer)) #2

@gus what do you think about it?


(Gus) #3

the reason you can bypass the check with junk is a bug in the 3rd party package. we could add support to match all instead of exact matching, i don’t think we need to change regexp packages yet. i’ll check.


(Nikita Zaletov) #4

actually, my question was not about why junk check passes, but why we need trigram index and 3 chars limit for regexp in filter section, when we just filter out already fetched predicate values


(Nikita Zaletov) #5

mmmmmmmm… any updates here?


(Gus) #6

Sorry I couldnt get back to you sooner. I’m reopening the original issue and will investigate. I don’t know why regexp isn’t used for filters, but i think it’s worth checking. My gut feeling is that regexps are very slow, not to mention that they can use lots of memory.

If others could weigh in on this, it would be great.

Ref: https://github.com/dgraph-io/dgraph/issues/2565


(Nikita Zaletov) #7

good news! since it’s filtering only, i don’t see any performance issues here - quite every database (elasticsearch, for example) has ability to post-filter returned data by regexp