Wrong filter section design

hi all.

this issue is more addressed to developer team, not support one, so if you @mrjn or @dmai have time to take a look, it would be great.

we have large amount of items and need to explore relationships with applying filters to outcoming nodes. actually we are making library management system for us/europe, but it would be easier to explain our problem on person/friends test schema.

so, for example, we have billions of persons with structure like:

{
	set {
		_:person1 <person_id> "person1" .
		_:person1 <name> "John" .
		_:person1 <age> "17" .
		_:person1 <sex> "M" .

		_:person2 <person_id> "person2" .
		_:person2 <name> "Martin" .
		_:person2 <age> "19" .
		_:person2 <sex> "M" .

		_:person3 <person_id> "person3" .
		_:person3 <name> "Peter" .
		_:person3 <age> "22" .
		_:person3 <sex> "M" .

		_:person4 <person_id> "person4" .
		_:person4 <name> "Melissa" .
		_:person4 <age> "17" .
		_:person4 <sex> "F" .
		
		_:person1 <has_friend> _:person2 .
		_:person1 <has_friend> _:person3 .
		_:person1 <has_friend> _:person4 .
	}
}

we create an index on person_id column to have ability to find persons by their ids.

person_id: string @index(hash) .
age: int .

now we need to find a person with id=“person1” (John) and display all his friends. it works ok:

{
	get_friends(func: eq(person_id, "person1")) {
		has_friend {
			person_id
			name
		}
	}
}

{
    "data": {
        "get_friends": [
            {
                "has_friend": [
                    {
                        "person_id": "person2",
                        "name": "Martin"
                    },
                    {
                        "person_id": "person3",
                        "name": "Peter"
                    },
                    {
                        "person_id": "person4",
                        "name": "Melissa"
                    }
                ]
            }
        ]
    }
}

now we want to filter his friends by age, to display only friends with age > 18. all friends are already found by outcoming has_friend edge, we just need to filter out returned nodes:

{
	get_friends(func: eq(person_id, "person1")) {
		has_friend @filter(ge(age, 18)) {
			person_id
			name
		}
	}
}

{
    "errors": [
        {
            "code": "ErrorInvalidRequest",
            "message": ": Attribute age is not indexed."
        }
    ],
    "data": null
}

and we have this error. why do we need to have an index here? we don’t need to retrieve all age edges with value >= 18, because there are billions of them, we just need to filter out already found nodes. having index is completely redundant here, and i am afraid that even if i add this index, all nodes with age >= 18 will be fetched (otherwise why do we need it here?). the same error appears if i place eq and other inequality filters.

if i want to find John’s male friends (having “M” sex), i also need an index and i am afraid that if i create it, all persons with “M” sex will be retrieved instead of just filtering John’s friends:

{
	get_friends(func: eq(person_id, "person1")) {
		has_friend @filter(eq(sex, "M")) {
			person_id
			name
		}
	}
}

{
    "errors": [
        {
            "code": "ErrorInvalidRequest",
            "message": ": Attribute sex is not indexed."
        }
    ],
    "data": null
}

and the same problem is when i want to find all John’s friends having name starting with “M” letter. here is another complexity that if i even create a trigram index (which shouldn’t be used in filter section at all), i am not able to use regular expressions shorter than 3 letters because trigram index cannot be used in this case:

{
	get_friends(func: eq(person_id, "person1")) {
		has_friend @filter(regexp(name, /^M.*/)) {
			person_id
			name
		}
	}
}

{
    "errors": [
        {
            "code": "ErrorInvalidRequest",
            "message": ": Predicate name is not indexed"
        }
    ],
    "data": null
}

name: string @index(trigram) .

{
    "errors": [
        {
            "code": "ErrorInvalidRequest",
            "message": ": Regular expression is too wide-ranging and can't be executed efficiently."
        }
    ],
    "data": null
}

but it’s just a filter, i need to filter out 3 nodes i already have.

my opinion is that filter should be a filter. it shouldn’t use an index and retrieve all nodes matching filter condition (we have billions of such).
i hope that this problem may be fixed.

thank you

any updates here?
if you don’t want to change @filter behavior, maybe it makes sense to add another clause like @noindexfilter, i don’t know. maybe use filtering instead of index lookup only if attribute from filter condition is not indexed.
but having ability to filter out already found nodes without retrieving all nodes from index (like filtering by sex) is must-have feature and it’s possible in all other graph databases like JanusGraph (and all other db supporting gremlin), Neo4j and others…

For now that proposal is not viable. Open a issue if you want to track this proposal. Today the team is focused on things of immediate need(Like bugs). And there are still many features in the queue to be worked on.

Also, this filter could be done at the application level.

Cheers.

i am not sure it’s a good idea to filter such data on a client, since I can have few thousand nodes and need only 1 of them.
whereas filtering at application level in the middle of query is just impossible.

for me, current filter implementation contains bugs, because it doesn’t do what it should to. retrieving attributes like “sex” from an index where I need to filter already fetched nodes is weird.

I created issue with proposal like this but related to regular expressions only, but you closed it and suggested to create a topic here. so, it’s an infinite loop.

still hope, that core team like @dmai will say something on this…

The most you can do today is similar to this
https://github.com/dgraph-io/dgraph/issues/2304

it’s completely different issue

I have the same problems that confuses me, because according to the documentation the root function should retrieve the nodes, and by logic the next filters should act only on those nodes, like almost every other db.

I have a predicate that I use for validation in order to be sure that the node is owned by a user, but the second filter ask for Index, so if my root query should return only one node, but all inner filter should search the full index table and since the same id would be in many nodes it also requires to filter with the root function again for every inner filter … Im not sure how this works and confuses me because there is not a clear explanation anywhere.

This can be done easily on my client but it does not feel right…

1 Like

Well, If Dgraph has the potential to do it, your code could do it too. That’s a point.

Please list these Bugs, we are working to heal them. If you have bugs and they do not need to be discussed to assess whether it is bug or not. Please open an Issue directly.

Well, are not they the same thing right? That your post is being reviewed by Gus. And I believe he’s given an opinion for that.

I’m sorry if this bothers you, it’s an important policy to keep the Github Issues as clean as possible. For better productivity and focus.

It may be, but it’s what you have available for now.

Gentlemans,

(@orlandoco, makitka)

We are here to discuss the feasibility and capabilities of the proposal to be valid or not. Or even be produced or not. Dgraph has a small team. That is in full charge. As I explained above, at the moment we must work the merit and not the wills.

Anyway We are open to receiving valid PRs for solving this.

Another important point worth mentioning is that if you analyze the Dgraph code. You will notice that functions follow a pattern (in functions design) that needs to be indexed. Anything out of that is another job. And to put work into it, we need time. That’s what I’m trying to explain.

Maybe there might be a trick we do not know that can do this for you. But overall, if you need to economize with indexing, you should use an application-level filtering mechanism when handling the response.

But obviously, it is not because I am responding in this way that this will not be evaluated. Everything here that is discussed we are always evaluating. Your feedback is very important.

Cheers.

i understand that team is small and don’t call for doing it right now, of course.

i just want team to agree that current implementation is not correct (or you think that it’s correct and filter works exactly as you designed) and it will be fixed in near or not very near future. it would be enough for me, in the meanwhile we will try to find workaround to deal with this problem right now.

anyway, it looks like a timebomb, because a lot of people who use filters don’t understand how they actually works, and it works well only while they have small amount of data.

1 Like

found very weird but working workaround. it uses edge with facets, filtering on facets doesn’t require index and is not performing redundant and slow lookups. the edge is pointing from node to itself (since filtering on value edge facets is not supported):

entity_key: string @index(exact) .

_:node <entity_key> "entity1" .
_:node <attrs> _:node (name="Nikita", sex="m") .

{
	get_entity(func: eq(entity_key, "entity1")) @cascade {
		uid
		attrs @facets(eq(name, "Nikita") and eq(sex, "m")) @facets(name, sex) # male
	}
}

{
    "data": {
        "get_entity": [
            {
                "uid": "0x1",
                "attrs": [
                    {
                        "attrs|name": "Nikita",
                        "attrs|sex": "m"
                    }
                ]
            }
        ]
    }
}

{
	get_entity(func: eq(entity_key, "entity1")) @cascade {
		uid
		attrs @facets(eq(name, "Nikita") and eq(sex, "f")) @facets(name, sex) # female now
	}
}

{
    "data": {
        "get_entity": []
    }
}

btw, seems like it’s not so hard to implement filtering by edge values since it’s already done for edge facet values.

1 Like

btw, I opened an issue to track this proposal, as you @MichelDiz suggested: Filtering is slow on large amount of data · Issue #2713 · dgraph-io/dgraph · GitHub

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.