21 million seems small

Just a crazy thought here… please bear with me @core-devs

What if we fingerprint predicate to uint64 as well? There are a lot of super long predicates and a lot of key space seems wasted, unless rocksdb does some compression based on prefixes, which is not unlikely.

Imagine 21m edges stored in the worst possible way:
pred, sub, obj and each is 8 bytes. (Assumption: Values don’t take too much more than 8 bytes…) This takes only
21e6 * 8 * 3 / (1024 * 1024)

This is only about 480M… A lot of them are going to have the same pred, sub, so it is probably going to be even smaller, say 240M.

Say we do convert predicates to fingerprints. Still, this test case seems too small as it can fit in memory in theory. Maybe we can focus on an in-memory solution and reach even crazier speed. I got a feeling that if everything is in memory, the loading is going to take <1 minute.

Add: By the way, I tried increasing the RAM limit to 32G and loader didn’t speed up by much. I got a feeling that our solution probably uses only a bit more than 4G. That said, 4G sounds a lot given that the data is probably representable in 240M…

Update:

I took all your suggestions (@ashwin95r, @mrjn). I concat names and rdf-films (call this combo dataset), then replace predicates with much shorter strings (call this comboless dataset), and re-run assigner and loader. Assigner took the same time on both datasets. For loader, it took almost 8 mins for comboless and 11min+ for combo.

Side observation: I set the mem limit to 4G, but I have observed that mem usage goes up to 11G. Not sure why.

I encourage you all to try out loader and see if you observe any speedup. Here is the script to generate combo and comboless.

gunzip -kf names.gz
gunzip -kf rdf-films.gz
cat names rdf-films > combo
wc -l combo
python process.py combo comboless
wc -l comboless
gzip -kf combo
gzip -kf comboless

Here is the python script process.py: (which shortens predicates and runs pretty fast)

import sys

assert len(sys.argv) == 3
input_file, output_file = sys.argv[1:3]

count = 0
pred = {}

with open(output_file, 'w') as fout:
  with open(input_file) as fin:
    for s in fin:
      count += 1
      if (count % 1000000) == 0:
        print 'Lines processed %d' % count
      s = s.strip()
      tok = s.split('\t')
      tok = [x.strip() for x in tok]
      assert len(tok) == 4
      assert tok[-1] == '.'
      
      if tok[1] in pred:
        tok[1] = pred[tok[1]]
      else:
        t = '<p%x>' % len(pred)  
        pred[tok[1]] = t
        tok[1] = t
        
      fout.write('\t'.join(tok) + '\n')


keys = sorted(pred.keys())
for i, k in enumerate(keys):
  print '%d: %s: %s' % (i, k, pred[k])

If you want to run loader and assigner, I usually do the following:

rm -Rf m p u

FILE=comboless.gz  # Or combo.gz.
RAM=4096

time dgraphassigner -stw_ram_mb $RAM --numInstances 1 --instanceIdx 0 \
--rdfgzips $FILE --uids u

time dgraphloader -stw_ram_mb $RAM --numInstances 1 --instanceIdx 0 \
--rdfgzips $FILE --uids u --postings p
2 Likes

Just sharing the numbers here. I was running the assigner and loader in my lap and these were the numbers:

cores: 2, ram: 6000M

time output
-----------------
assigner
real	6m56.064s
user	13m31.516s
sys	0m26.764s

loader
real	23m26.958s
user	32m23.264s
sys	3m5.488s

So with loader, the RAM usage reaches 7G (when sw_ram_mb is 6000) many times and the swap space is being used a lot as well. Also, the virtual memory usage for the loader reaches 50G which is a lot. Does’t this happen in your PC? @jchiu

And yeah, converting the predicates to uint64 sounds like a nice idea! we could try that approach and see how it works out.

It goes without saying that anything which doesn’t need to touch disk would be way faster than anything which does. There are in fact a bunch of graph solutions all focused on keeping things in memory, like AllegroGraph: http://franz.com/agraph/allegrograph/ Or, what Redis has done, keeping everything in memory. But, all this is cheating. I strongly believe – any database needs to be persistent, and any data loss is horrid. Mongo got a lot of flak because they claimed about crazy load speeds, but would only keep them in memory, and hence end up losing it (at 1.25 in the video):
https://www.youtube.com/watch?v=b2F-DItXtZs

So, if you want to achieve faster speeds, just put RocksDB on tmpfs. 1.5GB for the 21M RDFs is sufficient. And it would be an order of magnitude faster.

The great thing about RocksDB is that it allows one to tweak how much memory and how much SSD should one allocate to itself. So, one could just set up RocksDB to use lot more RAM and it would in-turn make Dgraph fast. The design of Dgraph is that we don’t use much RAM. And that’s great because then it can be used by RocksDB for it’s page cache.

So, now converting predicates to uint64s: Every time we do a fingerprint, we lose the ability to debug things manually. Every time we do indirection, we complicate stuff. If RocksDB does prefix compression, and that’s worth checking, it won’t save us much. But, we will still have to convert from predicate to UID, and from UID back to predicate for every query, which adds network calls – it adds more logic, so we have to be convinced of the benefits.

2 Likes

I agree with how important persistency is. Speed aside, I still find it interesting how we use up >4G of memory. Is it just because of the huge key space? I wonder what’s the ratio between key space and value space…

Regarding compression, my understanding is that there is no compression at memtable level. Link. There is also no or little compression in L0, L1 sstables.

Regarding fp(pred), if the number of predicates is small and storable in memory (which is the case for films dataset), then the indirection is not too bad. You can easily maintain a map from and to fingerprints and there won’t be additional network calls. If there are tons of different predicates, then this seems like a very weird database?

I think we should confirm that. I find it hard to believe that a data store which keeps keys in sorted order, wouldn’t allow a way to do prefix compression. SSTable used to do them, which is what LevelDB is based on, which in turn is what RocksDB is based on.

I think you’re solely thinking about a single instance. Think about how would this work in a cluster? All our XID <-> UID conversions happen on one instance. Given that we shouldn’t be special casing predicates, that’s the instance which is going to handle predicates as well. Unless, each server in the cluster has to hold the entire list of predicate <-> UID map, which then poses challenges about distributed transactions across the entire cluster.

Again, we don’t need to solve the problem for small data sets. We’re solving problems for much bigger data sets. The smaller ones would automatically take care of themselves.

In fact, we do have the entire freebase data set, which is 1.9B edges.

Given a graphQL query, you get a list of all predicates and XIDs that need translation to UIDs. Since you need to translate the XID anyway, this doesn’t add any additional network calls?

RocksDB uses a skiplist for its memtable. I highly doubt they will do compression over these skiplist keys. If it spills over, then it starts using SSTables. At this point, we are going into SSD which is in abundance… Even then, in their tuning guide, it seems that they advise people to have no compression for L0, L1 sstables. Whether this means there is no prefix compression, I am not sure…

On a different note, I read that RocksDB allows you to define a prefix extractor which will allow them to build some bloom filters around the prefix and possibly speed things up.

True. If we start with XIDs, yes. But, that’s optional. You could start directly with UID. Most future data sets which start with Dgraph would probably only have UIDs, not XIDs. But if we do this for predicate then we always have to do this conversion. No way around it.

I’m not saying this conversion is necessarily a bad thing. I just don’t have enough conviction that it would result in any significant savings worthy of the extra runtime logic.

I think they say that 90% of data lies in Lmax. So, that’s the one which should be compressed. I think if you’re confident this would lead to saving data, I’d suggest picking up the existing 21M RDFs, and loading it using uint64 predicates. Then you can compare this against our standard loader. This would give us the size difference – after we ensure we use compression.

Then we can run @ashwin95r’s query load tester and see how it reacts.

Btw, this is worth a read: https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-and-open-sourcing-rocksdb/10151822347683920/

My wife says that all the time – I think too small. And I often get caught up with small issues.

Yet, it is the desire to bring huge speedup / space savings that is driving these “crazy” ideas… In the past, with mapreduces, keys used to contain super long URLs, and often see mapreduces “fly” when we convert to fingerprints. That kind of feeling is very addictive.

I am not sure. Just chalking up a list of things / future projects. This can be a side thing to try.

2 Likes

I think a good way to systematically look for performance improvements would be to do cpu profiling and memory profiling and see what you could improve there. In general, if nothing Dgraph shows up in the top 10, we’re doing a good job! :smiley:

Actually, this is not a crazy idea. It’s a natural extension of the fact that we fingerprint both subject and objects, ergo, has been thought through in the past. Crazy would be something that we haven’t yet thought about, I guess… :stuck_out_tongue:

Talking about things that we don’t know much about or haven’t tried yet, it’s RocksDB Tuning. That’d be a topic worthy of exploration. I’m sure we can squeeze a lot out of this giant, that we currently don’t know much about.

Yes, RocksDB tuning seems to have huge potential… Much to learn and tweak there!

3 Likes

So, seems like data loading is faster, which sounds good, but that’s generally a one time thing. Two concerns that need to be addressed here:

  1. How does the query execution time get affected on a two or more server cluster? Assuming round robin distribution of queries, where half the time the predicate conversion needs to happen on server with index 0, incurring network calls.
  2. Is the final size of the data on disk different?

(1) (a) One thing that I forgot to mention. If we assume no collision, which I would be inclined to, then there is no need to look up anything for the predicates. Given a query, get the predicates, get the fingerprints, and issue RPCs to get the results. The RPC calls will contain only the fingerprints, not the predicates.

There is also the hope that (b) you will need to make an additional network call to convert xid<->uid anyway, and © more stuff can be cached in memory.

(2) The total space on disk is the same as expected. The total is about 760M.

a. I find it dangerous to assume zero collision. We can’t make that call until we know how many predicates we’re dealing with.

b. Having UIDs for entities is a very important optimization. In complex queries, we’ll have a large result set, involving millions of entities, and hence converting them to ints has significant space savings on disk, RAM and network. Additionally, comparison operators on ints are faster than on strings.

Moreover, if someone starts with Dgraph (as opposed to bringing an existing data set like we’re doing now), they won’t have any XIDs. They’d only have UIDs. In that case, there’s no conversion required to start with the query, which saves one network call for every query on such a system.

So, no. We won’t have to make that network call anyway. Not in native Dgraph setups.

c. RocksDB loads up the entire page as it’s on disk, in memory. So, if the disk usage is similar, the memory usage would be the same. It won’t save us anything in memory at RocksDB level. At Go level, the memory usage is already so little, so not sure what benefit would it have.

So, if we’re not saving on disk usage (and hence RocksDB memory usage), but we will incur additional network calls for a native Dgraph setup, what’s the benefit of doing this?

1 Like

This is somewhat counterintuitive to me, but you are right. There is not enough gain here.

As I read more about RocksDB the coming week, maybe I will revisit point ©. For now, the case is closed :slight_smile:

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.