Just a crazy thought here… please bear with me @core-devs
What if we fingerprint predicate to uint64 as well? There are a lot of super long predicates and a lot of key space seems wasted, unless rocksdb does some compression based on prefixes, which is not unlikely.
Imagine 21m edges stored in the worst possible way:
pred, sub, obj
and each is 8 bytes. (Assumption: Values don’t take too much more than 8 bytes…) This takes only
21e6 * 8 * 3 / (1024 * 1024)
This is only about 480M… A lot of them are going to have the same pred, sub
, so it is probably going to be even smaller, say 240M.
Say we do convert predicates to fingerprints. Still, this test case seems too small as it can fit in memory in theory. Maybe we can focus on an in-memory solution and reach even crazier speed. I got a feeling that if everything is in memory, the loading is going to take <1 minute.
Add: By the way, I tried increasing the RAM limit to 32G and loader didn’t speed up by much. I got a feeling that our solution probably uses only a bit more than 4G. That said, 4G sounds a lot given that the data is probably representable in 240M…
Update:
I took all your suggestions (@ashwin95r, @mrjn). I concat names and rdf-films (call this combo dataset), then replace predicates with much shorter strings (call this comboless dataset), and re-run assigner and loader. Assigner took the same time on both datasets. For loader, it took almost 8 mins for comboless and 11min+ for combo.
Side observation: I set the mem limit to 4G, but I have observed that mem usage goes up to 11G. Not sure why.
I encourage you all to try out loader and see if you observe any speedup. Here is the script to generate combo and comboless.
gunzip -kf names.gz
gunzip -kf rdf-films.gz
cat names rdf-films > combo
wc -l combo
python process.py combo comboless
wc -l comboless
gzip -kf combo
gzip -kf comboless
Here is the python script process.py
: (which shortens predicates and runs pretty fast)
import sys
assert len(sys.argv) == 3
input_file, output_file = sys.argv[1:3]
count = 0
pred = {}
with open(output_file, 'w') as fout:
with open(input_file) as fin:
for s in fin:
count += 1
if (count % 1000000) == 0:
print 'Lines processed %d' % count
s = s.strip()
tok = s.split('\t')
tok = [x.strip() for x in tok]
assert len(tok) == 4
assert tok[-1] == '.'
if tok[1] in pred:
tok[1] = pred[tok[1]]
else:
t = '<p%x>' % len(pred)
pred[tok[1]] = t
tok[1] = t
fout.write('\t'.join(tok) + '\n')
keys = sorted(pred.keys())
for i, k in enumerate(keys):
print '%d: %s: %s' % (i, k, pred[k])
If you want to run loader and assigner, I usually do the following:
rm -Rf m p u
FILE=comboless.gz # Or combo.gz.
RAM=4096
time dgraphassigner -stw_ram_mb $RAM --numInstances 1 --instanceIdx 0 \
--rdfgzips $FILE --uids u
time dgraphloader -stw_ram_mb $RAM --numInstances 1 --instanceIdx 0 \
--rdfgzips $FILE --uids u --postings p