Suggestions to Improve GroupBy

Related discussions
How to groupby date - #7 by labs20
Running total (cumulative sum) by date - #4 by MichelDiz
What is the technical reason we can't store variables of @groupby unless it uses "uid"?

The GroupBy function today only works by UID, but this term “group by” is used in other database technologies in an expressive and unrestricted way. Where we can group results by values and not only relationships. This is a very important feature. And it seems that a single step is missing for the Dgraph also support it. For Dgraph seems to be able to do this.

The query below works perfectly. Where I group by a predicate containing attributes. Taking the example “First name”. It could be “datetime (hour, day, month, year)”, “Score”, “age”, “Tag”, “status (New, updated, outdated, deleted, refurbished / status code” and so on.

Of course the “first name” example is very hypothetical to be used in real life. Consider the other possibilities.

However, the query does not allow me to create a second block. To which I would treat the results better. And get a desired result.

{
 q(func: has(group_name)) {
    group_name
    member @groupby(first_name) {
    Total_first_name : count(uid)
    }
  }
}
{
  "data": {
    "q": [
      {
        "group_name": "Some Group",
        "member": [
          {
            "@groupby": [
              {
                "first_name": "Michel",
                "Total_first_name": 6
              }
            ]
          }
        ]
      },
      {
        "group_name": "Some Group 2",
        "member": [
          {
            "@groupby": [
              {
                "first_name": "Daniel",
                "Total_first_name": 3
              },
              {
                "first_name": "Lucas",
                "Total_first_name": 3
              }
            ]
          }
        ]
      }
    ]
  }

The desired result

Update in FOREACH func in DQL (loops in Bulk Upsert) - #5 by MichelDiz

{
 var(func: has(group_name)) {
    member @groupby(first_name) {
    T as count(uid)
    }
  }

q(func: uid(T)) {
  first_name
  last_name
  Total_first_name : val(T)
}

}

The Result

{
  "data": {
    "q": [
      {
        "Michel": [ 
              {
                "Total_first_name": 6
              },
              {
                "uid": "0x1",
                "first_name": "Michel"
              },
              {
                "uid": "0x6",
                "first_name": "Michel"
              },
              {
                "uid": "0x8",
                "first_name": "Michel"
              },
              {
                "uid": "0x9",
                "first_name": "Michel"
              },
              {
                "uid": "0xf",
                "first_name": "Michel"
              },
              {
                "uid": "0xa",
                "first_name": "Michel"
              }
      ]
      },
      {
        "Daniel": [
              {
                "Total_first_name": 3
              }, 
              {
                "uid": "0x2",
                "first_name": "Daniel"
              },
              {
                "uid": "0xf22",
                "first_name": "Daniel"
              },
              {
                "uid": "0xf4",
                "first_name": "Daniel"
              }
      ]
      },
      {
        "Lucas": [ 
              {
                "Total_first_name": 3
              },
              {
                "uid": "0x3",
                "first_name": "Lucas"
              },
              {
                "uid": "0xf44",
                "first_name": "Lucas"
              },
              {
                "uid": "0xf77",
                "first_name": "Lucas"
              }
      ]
      }
    ]
  }

Schema and Mutation

<member>: [uid] @count @reverse .

{
  set {
    
    _:InGroup <group_name> "Some Group" .
    _:InGroup <member> _:u .
    _:InGroup <member> _:u2 .
    _:InGroup <member> _:u3 .
    _:InGroup <member> _:u4 .
    _:InGroup <member> _:u5 .
    _:InGroup <member> _:u6 .
    
    _:u <user> "" .
    _:u <first_name> "Michel" .
    _:u <last_name> "A" .
    
    _:u2 <user> "" .
    _:u2 <first_name> "Michel" .
    _:u2 <last_name> "B" .
    
    _:u3 <user> "" .
    _:u3 <first_name> "Michel" .
    _:u3 <last_name> "C" .
    
    _:u4 <user> "" .
    _:u4 <first_name> "Michel" .
    _:u4 <last_name> "D" .
    
    _:u5 <user> "" .
    _:u5 <first_name> "Michel" .
    _:u5 <last_name> "E" .
    
    _:u6 <user> "" .
    _:u6 <first_name> "Michel" .
    _:u6 <last_name> "F" .
  
    _:InGroup2 <group_name> "Some Group 2" .
    _:InGroup2 <member> _:uB .
    _:InGroup2 <member> _:u2B .
    _:InGroup2 <member> _:u3B .
    _:InGroup2 <member> _:u4B .
    _:InGroup2 <member> _:u5B .
    _:InGroup2 <member> _:u6B .
    
    _:uB <user> "" .
    _:uB <first_name> "Lucas" .
    _:uB <last_name> "A" .
    
    _:u2B <user> "" .
    _:u2B <first_name> "Lucas" .
    _:u2B <last_name> "B" .
    
    _:u3B <user> "" .
    _:u3B <first_name> "Lucas" .
    _:u3B <last_name> "C" .
    
    _:u4B <user> "" .
    _:u4B <first_name> "Daniel" .
    _:u4B <last_name> "D" .
    
    _:u5B <user> "" .
    _:u5B <first_name> "Daniel" .
    _:u5B <last_name> "E" .
    
    _:u6B <user> "" .
    _:u6B <first_name> "Daniel" .
    _:u6B <last_name> "F" .


  }
}
1 Like

There was a question from a user where they need to groupby a string predicate in order to find duplicate IDs in the system. The query looked something like this for the following schema:

accountType: string .
idVal: string .
{
 var(func: has(accountType)) @groupby(idVal) {
   cnt as count(uid)
}
 # Find duplicate idVal strings
 data(func: ge(val(cnt),2)) {
   expand(all)
 }
}

But this query returns this error:

Vars can be assigned only when grouped by UID attribute

I suggested that they can do this “count >= 2” filtering from the client-side after receiving the query response.

1 Like

Just updating:

The user “AugustHell” reminded me of this post

See his interactions:

Group by nested fields · Issue #1562 · dgraph-io/dgraph · GitHub ,

1 Like

One example as a reminder:

Taking https://tour.dgraph.io/master/schema/2/ as reference
I’ve edited this sample, adding multiple users with the same name, but keeping the companies unique. So, company Y has two (or more) users with the same name (namesake).

@animesh2049 the @normalize could work with groupby? this would be nice.

Doing a query:

{
  var(func: anyofterms(name, "Alexei Jack Ivy Zoe Jose")) @groupby(works_for) {
   a as  count(uid)
  }

  q(func: uid(a), orderdesc: val(a)) {
    company_name:name
    workers : val(a)
    works : ~works_for @normalize @groupby(name) { 
       count(uid)
    }
  }
}

Desired Result

This structure comes doing this query

"works": [
          {
            "@groupby": [

But the desired result is

{
  "data": {
    "q": [
      {
        "company_name": "CompanyABC",
        "workers": 6,
        "works": [
              {
                "name": "Ivy",
                "count": 2
              },
              {
                "name": "Jack",
                "count": 2
              },
              {
                "name": "Zoe",
                "count": 2
              }
        ]
      },
      {
        "company_name": "The other company",
        "workers": 4,
        "works": [
              {
                "name": "Alexei",
                "count": 2
              },
              {
                "name": "Jose",
                "count": 2
              }
        ]
      }
    ]
  }
}
1 Like

Adding one more use case for this.

This is an unusual use case

Dmai needed to find homonyms and then isolate them(filter them). So there’s no way to do that, find duplicated entities by their value.
One “hacky” way of doing it was this:

{
	"set": [{
			"name": "Film B",
			"actors": [
				{
					"name": "Josh"
				},
				{
					"name": "Josh"
				},
				{
					"name": "Lucas"
				}
			]
		}
	]
}
{
  q2(func: eq(name, "Film B")) {
    actors @groupby(name) {
      count(uid)
    }
  }
}

Result

{
  "data": {
    "q2": [
      {
        "@groupby": [
          {
            "name": "Lucas",
            "count": 1
          },
          {
            "name": "Josh",
            "count": 2
          }
        ]
      }
    ]
  }
}

So this way we can find the duplicate ones and count the duplicated ones. But unfortunately, we can’t get their uid which would be very handy. This improvement aims that.

1 Like

Another Groupby usage.

One thing I find strange about GB’s behavior is that it returns the grouping only that it is not possible to open the values bound to that grouping. Unless of course, they are children of the group or have a reverse edge.

e.g (Dataset in the end):

This query

{
  variableblock(func: type(Person)) @groupby(studied_at) {
    a as count(uid)
  }
  
  data(func: uid(a)){
    uid
    name
    total: val(a)
  }
}

Returns

{
  "data": {
    "variableblock": [
      {
        "@groupby": [
          {
            "studied_at": "0x4e22",
            "count": 2
          },
          {
            "studied_at": "0x4e21",
            "count": 3
          }
        ]
      }
    ],
    "data": [
      {
        "uid": "0x4e21",
        "name": "Main City School",
        "total": 3
      },
      {
        "uid": "0x4e22",
        "name": "Other City School",
        "total": 2
      }
    ]
  }
}

Okay, but I can’t expand the nodes I have in mind that belong to these groups.
The only way to do so is turning studied_at as reverse edge ~studied_at { name } Which is not good, because potentially I would have to put everything as reverse. And that consumes disk space.

So the idea here would be to use expand (val (a)) as the custom block idea intended.

e.g:

{
  var(func: type(Person)) @groupby(studied_at) {
    a as count(uid)
  }
  
  q(func: uid(a)){
    uid
    name
    jobTitle
    total: val(a)
    in_this_group: expand(val(a))
  }
}

And the response would be like

{
	"data": {
		"q": [{
				"uid": "0x4e21",
				"name": "Main City School",
				"total": 3,
				"in_this_group": [{
						"uid": "0x4e23",
						"name": "Jane Doe"
					},
					{
						"uid": "0x4e24",
						"name": "Lucas Doe",
						"jobTitle": "Student"
					},
					{
						"uid": "0x4e25",
						"name": "Olivia Doe",
						"jobTitle": "Student"
					}
				]
			},
			{
				"uid": "0x4e22",
				"name": "Other City School",
				"total": 2,
				"in_this_group": [{
						"uid": "0x4e26",
						"name": "Noah Doe",
						"jobTitle": "Student"
					},
					{
						"uid": "0x4e27",
						"name": "Sophia Doe",
						"jobTitle": "Student"
					}
				]
			}
		]
	}
}

Dataset

{
  "set":[	{
		"uid": "_:School1",
		"dgraph.type": "School",
		"name": "Main City School",
		"telephone": "(425) 123-4567",
		"url": "http://www.maincityschool.com"
	},
 {
		"uid": "_:School2",
		"dgraph.type": "School",
		"name": "Other City School",
		"telephone": "(425) 333-4567",
		"url": "http://www.othercityschool.com"
	},
	{
		"dgraph.type": "Person",
		"name": "Jane Doe",
		"jobTitle": "Student",
		"studied_at": {
			"uid": "_:School1"
		}
	},
	{
		"dgraph.type": "Person",
		"name": "Lucas Doe",
		"jobTitle": "Student",
		"studied_at": {
			"uid": "_:School1"
		}
	},
	{
		"dgraph.type": "Person",
		"name": "Olivia Doe",
		"jobTitle": "Student",
		"studied_at": {
			"uid": "_:School1"
		}
	},
	{
		"dgraph.type": "Person",
		"name": "Noah Doe",
		"jobTitle": "Student",
		"studied_at": {
			"uid": "_:School2"
		}
	},
	{
		"dgraph.type": "Person",
		"name": "Sophia Doe",
		"jobTitle": "Student",
		"studied_at": {
			"uid": "_:School2"
		}
	}
]
}
1 Like

We already support group by scalar feat(DQL): @groupby on scalar fields and count duplicate by minhaj-shakeel · Pull Request #7746 · dgraph-io/dgraph · GitHub

Just need to check if it is the same improvements suggested.