Hi, I am fairly new to dgraph and just wanted to confirm if I am going about this the most optimal way.
My current problem is that I am trying to deduplicate seasonal fans across multiple teams to determine how many unique fans a team has out of all of the fans of a set of teams. So far I have not seen ideal response times and I am not sure if it is because of how I am modeling the data or something else.
The first data model I used is below (the A and B correspond to different data sources):
name: string @index(term) .
fan_s2020_A: [uid] @reverse .
fan_s2020_B: [uid] @reverse .
fan_s2019_A: [uid] @reverse .
fan_s2019_B: [uid] @reverse .
type Person {
fan_s2020_A
fan_s2020_B
fan_s2019_A
fan_s2019_B
}
type Team {
name
}
Example query deduplicating fans across two teams:
{
# Fans of first team for 2020 season
var(func: eq(name, "team-0")) {
~fan_s2020_A {
fan_0_A as uid
}
~fan_s2020_B {
fan_0_B as uid
}
}
var(func: uid(fan_0_A, fan_0_B)) {
fans_0 as uid
}
var(func: eq(name, "team-1")) {
~fan_s2020_A {
fan_1_A as uid
}
~fan_s2020_B {
fan_1_B as uid
}
}
# Fans of second team
var(func: uid(fan_1_A, fan_1_B)) {
fans_1 as uid
}
# Unique fan counts of each team
unique_0_fan(func: uid(fans_0)) @filter(NOT uid(fans_1)) {
count(uid)
}
unique_1_fan(func: uid(fans_1)) @filter(NOT uid(fans_0)) {
count(uid)
}
# Total fan count
union(func: uid(fans_1, fans_0)) {
count(uid)
}
}
As I compare more and more teams I would just create more var
blocks for the other teams and add those variables to the NOT filter (i.e. NOT uid(fans_0, fans_1, ...)
) in the unique count queries.
I have also tried modelling this data similarly to what is described in this comment in another thread where instead of origin
as the facet I had data_provider
. The schema/example query for that is below:
name: string @index(term) .
fan_s2020: [uid] @reverse .
fan_s2019: [uid] @reverse .
relates_to: [uid] @reverse .
type Person {
fan_s2020
fan_s2019
}
type Queue {
relates_to
}
type Team {
name
}
{
# Fans of first team
var(func: eq(name, "team-0")) {
~relates_to {
~fan_s2020 {
fans_0 as uid
}
}
}
# Fans of second team
var(func: eq(name, "team-1")) {
~relates_to {
~fan_s2020 {
fans_1 as uid
}
}
}
# Unique fan counts of each team
unique_0_fan(func: uid(fans_0)) @filter(NOT uid(fans_1)) {
count(uid)
}
unique_1_fan(func: uid(fans_1)) @filter(NOT uid(fans_0)) {
count(uid)
}
# Total fan count
union(func: uid(fans_1, fans_0)) {
count(uid)
}
}
I found that the first data model seemed to perform slightly better than the second when looking at a small number of teams, but as the team count got larger, they both seemed to perform the same.
Am I going about this the right way or should one of these data models outperform the other when the team count is large? Or are my queries/data models not optimized for this? Feedback will be greatly appreciated.