Hi,
We have internet network data (packets sent, received, return trip times, packet loss etc) that is highly connected with locations (countries, cities etc), devices(ios, desktop), apps (youtube, reddit) & ofcourse time! For eg this is the schema:
name: string @index(hash) .
alias: string @index(hash) .
country: string @index(hash) .
state: string @index(hash) .
city: string @index(hash) .
ztm: dateTime @index(hour) .
apm_rtt: int .
packets_sent: int .
packets_received: int .
industry: uid @reverse .
apptag: uid @reverse .
stat.key: uid @reverse .
stat.isp: uid @reverse .
stat.app: uid @reverse .
stat.loc: uid @reverse .
stat.time: uid @reverse .
stat.os: uid @reverse .
At scale, there’s millions of “apm_rtt”, “packets_sent”, “packets_received” predicates connected to far fewer uid predicates (“stat.key”, “stat.isp”, “stat.app”, “stat.time”) because we’re collecting that info every min. In nutshell, it is a highly connected time-series data.
So the query below took ~6 seconds (because of too many ~stat.key predicates, one for each minute):
{
var(func: eq(name, "Unclassified")) @cascade {
~industry{
ga_stats as ~stat.key {
stat.app @filter(eq(name, "Google Analytics")) {}
}
}
}
var(func: uid(ga_stats)) @groupby(stat.key) {
ar as avg(apm_rtt)
}
answer(func: uid(ar)){
alias
avgRtt: val(ar)
}
}
i tried aws neptune with exact data (~42million rdf triples) & the same query in neptune took >2mins & kept timing out:
#!/bin/bash
sq='PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT ?key (AVG(?rtt) AS ?avgRtt)
WHERE {
?ind :name "Unclassified"^^<xs:string> .
?ind ^:industry ?key .
?key ^:stat.key/:apm_rtt ?rtt .
?key ^:stat.key/:stat.app/:name ?app .
FILTER contains(?app, "Google Analytics") .
}
GROUP BY ?key
'
curl -X POST --data-urlencode "query=$sq" http://<cluster-url>.us-east-1.neptune.amazonaws.com:8182/sparql
Now, i’ve been able to speed things up by pre-processing & compressing thousands of datapoints collected over an hour into 1 datapoint. But this is just delaying the problem.
We only need to keep a time window in the graph (between times t1 & t2). So i am considering 2 possiblities:
- have 1 graph but delete/clean-up nodes & edges that are less than time t1
- create a sub-graph, where each sub-graph corresponds to a particular time t^i (so within the sub-graph there’s no concept of time, the only focus is on relationships between data)
What are the pros & cons of each approach? And is dgraph more efficient working with one big graph vs lots of sub-graphs?
Regards