Checking capitalisation with regex

Hi Dgraph People :slight_smile:

I am trying to match string properties starting with capital letters, but I’m struggling to find a regex that fits the criteria for a trigram query.

Obvious non-trigram solutions include /^[A-Z]/, /^[A-Z].*/, and /^[A-Z].*$/ all of which fail with : Regular expression is too wide-ranging and can't be executed efficiently. on efd1742 (current head of master).

There are a few seemingly undocumented (though possibly obvious) limitations that make this regex impossible:

  • ^ and $ do not count towards matching runes. You tell tell because /^a$/ is too wide ranging. So we have to include two of this . or [a-zA-Z] after the initial [A-Z] to match a full trigram.
  • Wildcards in the middle of a set of three characters make the query too wide ranging: /aa.a/ and /aa.*a/ is too wide ranging while /aaa/ is fine. Perhaps this is expected behaviour, but based on the documentation /aa.*a/ surely matches the trigram “aaa”, so I don’t see why it would fail.
  • Three character regexes have to have an extremely limited range of possibilities - /aa[A-U]/ is too wide ranging, and it surely only matches 21 possibilities. So /[A-Z][A-Za-z][A-Za-z]/ and /[A-Z]../ seem totally out of the question.

Is there any hope to get this kind of query working? Should I do some kind of post processing instead?

Regex is to wide error is thrown when the number of uids that it matches to exceeds 1 million. This limit is hardcoded right now, though we could make it configurable. So essentially, it also depends on your data set.

We could remove that limit altogether. The worst that would happen is a jump in memory usage, which could potentially cause an OOM, but that depends upon how much memory is allocated to Dgraph. So, it’s not something we can judge on behalf of the user.

Yeah it would definitely help to be able to configure the limit.

Also, it seems like regex is too wide error is thrown in another case when we match all (I assume that means .?) or match none https://github.com/dgraph-io/dgraph/blob/master/worker/trigram.go#L116. Is there a workaround for that? Would I have to do my own post processing?

Sure, we could allow the ability to configure the limit. I am not very sure about . but can get that checked. I would recommend filing an issue so that it can be tracked.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.