I tried running dgraphloader to load names.gz and rdf-films.gz, completely on ramfs. It took 2min 24s. It is roughly a 2X speedup over running on SSD. The RAM limit is set to 4G but at times, it goes beyond and close to 6G.
Then I wrote a very short C++ program to parse RDFs and load into a map
/ balanced tree. The key is pair<uint64, uint64>
for predicate and source UID. The value is either a string for attribute values, or a set<uint64>
representing a posting list. The program is single-threaded. There is no parallelization. It also writes out all the data in binary format to SSD. However, it assumes that the inputs are already unzipped. The unzipping takes about 4s on my machine. The program itself took 24s. Let’s just say that overall, it took <30s. The program used up to 2.8% of my 64G RAM, which is about 1.8G RAM.
The C++ little program is by no means a fair comparison with dgraphloader. It doesn’t scale and it cheats by loading everything before writing out once, unlike a LSM tree. The main point of the exercise is to gauge the “theoretical best performance” we can aim for. It doesn’t do any parallelization, so the theoretical best is probably <20s.