Need some suggestions on correctly modelling data as a graph


#1

Hi,
We have internet network data (packets sent, received, return trip times, packet loss etc) that is highly connected with locations (countries, cities etc), devices(ios, desktop), apps (youtube, reddit) & ofcourse time! For eg this is the schema:

	name: string @index(hash) .
	alias: string @index(hash) .
	country: string @index(hash) .
	state: string @index(hash) .
	city: string @index(hash) .
	ztm: dateTime @index(hour) .

	apm_rtt: int .
	packets_sent: int .
        packets_received: int .

	industry: uid @reverse .
	apptag: uid @reverse .
	stat.key: uid @reverse .
	stat.isp: uid @reverse .
	stat.app: uid @reverse .
	stat.loc: uid @reverse .
	stat.time: uid @reverse .
	stat.os: uid @reverse .

At scale, there’s millions of “apm_rtt”, “packets_sent”, “packets_received” predicates connected to far fewer uid predicates (“stat.key”, “stat.isp”, “stat.app”, “stat.time”) because we’re collecting that info every min. In nutshell, it is a highly connected time-series data.

So the query below took ~6 seconds (because of too many ~stat.key predicates, one for each minute):

{
   var(func: eq(name, "Unclassified")) @cascade {
       ~industry{
          ga_stats as ~stat.key {
             stat.app @filter(eq(name, "Google Analytics")) {}
          }
       }
    }
    var(func: uid(ga_stats)) @groupby(stat.key) {
      ar as avg(apm_rtt)
    }
    answer(func: uid(ar)){
      alias
      avgRtt: val(ar)
    }
    }

i tried aws neptune with exact data (~42million rdf triples) & the same query in neptune took >2mins & kept timing out:

#!/bin/bash

sq='PREFIX : <http://xmlns.com/foaf/0.1/>
SELECT ?key (AVG(?rtt) AS ?avgRtt)
WHERE {
 ?ind :name "Unclassified"^^<xs:string> .
 ?ind ^:industry ?key .
 ?key ^:stat.key/:apm_rtt ?rtt .
 ?key ^:stat.key/:stat.app/:name ?app .
 FILTER contains(?app, "Google Analytics") .
}
GROUP BY ?key
'

curl -X POST --data-urlencode "query=$sq" http://<cluster-url>.us-east-1.neptune.amazonaws.com:8182/sparql

Now, i’ve been able to speed things up by pre-processing & compressing thousands of datapoints collected over an hour into 1 datapoint. But this is just delaying the problem.

We only need to keep a time window in the graph (between times t1 & t2). So i am considering 2 possiblities:

  • have 1 graph but delete/clean-up nodes & edges that are less than time t1
  • create a sub-graph, where each sub-graph corresponds to a particular time t^i (so within the sub-graph there’s no concept of time, the only focus is on relationships between data)

What are the pros & cons of each approach? And is dgraph more efficient working with one big graph vs lots of sub-graphs?

Regards