SELECT DISTINCT in Dgraph

PS. I’m using v20.03.3

In SQL <SLC DIST> is limited to selecting values from a row that are not the same. In Dgraph, this concept changes a little bit. Since we don’t have columns and rows, not every SQL function makes sense in a GraphDB. In the case of SLC DIST, it is possible, but with some modifications in the dataset.

We can use Dgraph’s GroupBy to achieve a result very similar to the desired one with SLC DIST. But for that, when migrating your dataset from SQL to Dgraph, you need to transform the value you want to select as “DISTINCT” as an entity instead of a value. GroupBy does not group by value for now.

To properly migrate a value to an entity I suggest using Upsert Block. Or if you have a huge dataset you could use OpenRefine software that you can make data mining and fix the data itself.

e.g.:

Let’s consider this dataset:

{
   "set": [
      {
         "userID": "1",
         "name": "Thomas A. Anderson",
         "Country": "Matrix"
      },
      {
         "userID": "2",
         "name": "Agent Smith",
         "Country": "Matrix"
      },
      {
         "userID": "3",
         "name": "Arthur Schopenhauer",
         "Country": "Germany"
      },
      {
         "userID": "4",
         "name": "Till Lindemann",
         "Country": "Germany"
      },
      {
         "userID": "5",
         "name": "Michel",
         "Country": "Brazil"
      },
      {
         "userID": "6",
         "name": "Andreas Rudolf Kisser",
         "Country": "Brazil"
      },
      {
         "userID": "7",
         "name": "Derrick Green",
         "Country": "United States"
      }
   ]
}

Run this Upsert 7 times:

upsert {
  query {
    q(func: has(Country), first:1) @filter(NOT has(migrated)) {
      v as uid
      C as Country
    }
    q2(func: eq(name, val(C))) {
      val(C)
      u2 as uid
    }
    tmp(){
    Ctmp as sum(val(C))
    }
  }

  mutation @if(eq(len(u2), 0) AND eq(len(v), 1)) {
    set {
      _:New <name> val(Ctmp) .
      _:New <dgraph.type> "Country" .
      uid(v) <livesIN> _:New .
      uid(v) <migrated> "" .
    }
  }

  mutation @if(eq(len(u2), 1) AND eq(len(v), 1)) {
    set {
      uid(v) <livesIN> uid(u2) .
      uid(v) <migrated> "" .
    }
  }
}

You can use this query to check the dataset

{
  me(func: has(name)) @filter(NOT type(Country)) {
    uid
    name
    Country
    livesIN {
      name
    }
  }
}

The comparison

Considering the previous dataset, in SQL it would do like:

SELECT DISTINCT Country FROM something;

Since we don’t have Tables, we could consider “FROM” related to Dgraph Type. Anyway, we are not working wih type for now. In Dgraph this SQL would be converted to (Of course, after migrating the value to the entity as mentioned previously):

{
  var(func: has(name)) @groupby(livesIN) {
      a as count(uid)
  }

  SELECT_DISTINCT(func: uid(a), orderdesc: val(a)) {
    name
    total : val(a)
    total_unique : count(uid)
  }
}

The total_unique value would be in SQL

SELECT COUNT(DISTINCT Country) FROM something;

Result

{
  "data": {
    "SELECT_DISTINCT": [
     {
        "total_unique": 4
      },
      {
        "name": "Matrix",
        "total": 2
      },
      {
        "name": "Germany",
        "total": 2
      },
      {
        "name": "Brazil",
        "total": 2
      },
      {
        "name": "United States",
        "total": 1
      }
    ]
  },
}
1 Like

Three ideas we can explore:

  1. Support group by with values
  2. A function called val_distinct(a) or distinct(a) that operates on a value variable
  3. A function called distinct(val(a)) that operators on values.
1 Like

Is there any update on this? I am trying to generate some dynamic option fields based on queries. I have a query that returns 27K nodes, but only has a few distinct values for the predicate I am using. Trying to do all of this in GraphQL which doesn’t seem possible for this one yet. Has this even made it into DQL yet?

Deja vu, I find myself back here again.

@dmai, has any of these ideas been explored?

UPDATE(for myself next time searching for this): too much to remember sometimes…

Here is the syntax query for the above data:

{
  distinctCountries(func: has(Country)) @groupby(Country) {
    # no need to select any fields, Country will be selected in the group by results
    # if you want to count the duplicates, use: count(uid)
  }
}

This syntax would return the following results:

{
  "data": {
    "distinctCountries": [
      {
        "@groupby": [
          {
            "Country": "Matrix"
          },
          {
            "Country": "Germany"
          },
          {
            "Country": "Brazil"
          },
          {
            "Country": "United States"
          }
        ]
      }
    ]
  }
}

And for reference, this can then be included as a custom GraphQL query using the @remoteResponse directive

2 Likes