In my DB I have inserted multiple Offices, which have multiple jobs and multiple employees.
An employee can be assigned multiple jobs.
I am interested in a query which returns the office with the max employees per head count. Note, as employees can be logically assigned multiple jobs, a unique count at the level of office is required.
Unfortunately no, this would give me all employees. I am primarily interested in the number of employees per office. With the query you provided I lose the information in what office which employee is working. Even if I would go from there using reverse edges I still would have to hop over the job edge, again resulting in a different answer than what I am looking for.
Getting the cardinality of the @groupby array Unique count of edges - #3 by the42 is what I am looking for - it represents the number of distinct employees per office.
Michel’s query will work well per office, so if you query a single office at root, you can get employees per job.
What you want is to have groupby of employees by office. For that to work you’d need an additional Office --> Employee edge. I feel it makes logical sense to have this edge as well and it will probably simplify a lot of queries.
So this won’t work. And I’m not sure it’s possible do do in a single query, probably not.
Maybe you can keep an officeEmployees predicate, and then use it in queries.
Coming from an SQL background, graph-thinking is still new for me. I am evaluating a Graph database as a means to implement some sort of “data lake” for my organisation. It feels natural to use a graph data base as it naturally supports easy interlinking between entities, sparse objects and schema-free operations.
However, performing operations on data structures do not come naturally with DGraph. So instead of ELT I would have to do ETL - transform the data before loading to give answers. However this violates in my opinion two “big data” principles:
Extract - Load - Transfer = Load the data first (instead of ETL in a BI-world), Transfer while querying
The Algorithm should come to the data, not the data to the algorithm. Sure, I can always extract data into a third-party tool and perform further aggregations there. But I think it is better that the algorithm comes to the data instead of sending gobs of data over the network.
The problem with SQL is that adding a new relation can be problematic. In dgraph you can add this edge relatively quickly at basically no cost.
I think your problem could also be solved with foreach loop, which is a proposal for now. Then you could loop for each office, get empl for each one; and aggregate/range at the end.
You would be basically looping a query like this, which I think gets what you need for one office: