Idea: When using live loader to import mass data offer a flag --cooldown [seconds Int]
TL;DR; Honor live loader process completion over process speed.
How this helps me: My environment is running on the recommended 8Gb of RAM but when trying to load 6 Million Rows of data using the live loader, it crashes Alpha for OOM around 60% of completion. When alpha crashes it does restart but the live loader stops processing with a “trasnport is closing” message. If there was a way to limit the input rate and give a time to cool down and process what has been submitted every X seconds then I believe it would yield a completed process with expected results. I would rather see completion overall instead of overall throughput without completion.
Alternatively: If a flag -memory-limit [Megabyte Int] could be implemented to watch memory consumption and throttle the input rate as to what that memory could sustain. This would allow a better user experience as a user could set a memory threshold and the live loader could use up to that amount and then throttle input to keep under that threshold to keep from running into OOM issues.
How to Throttle: I am not the expert here, but I would think that throttling the chunk size would be one attainable solution or adding in a timeout between chunks may be another option.
USE CASE: Allow for larger data sets while testing in free-tier environments. Running Dgraph in a single host AWS free tier is limited to 1Gb RAM. If there was a way to do a prolonged live load data into this test environment it would save costs for testing. During a test environment, speed is not necessary, but completion of an import process is.
USE CASE: In a production environment it is not always possible to scale up/down RAM if on hard metal machines. A user may find that a 8Gb machine is more than enough to run his Dgraph environment but then when trying to use live loader, the 8Gb is not sufficient. In this case the hard metal machine would have to be shut down configured with more physical hardware if available and then started back up. This is not always an available option for some servers. These users should have access to a throttled live laoder that would work for their machine without needing to build a new machine that will be 75% overkill 95% of the time. That is a lot of wasted physical resources not to mention costs involved.
If we did the --cooldown version how would I know what to set it to? I would definitely lean towards the --memory-limit one. Or maybe even optionally allow a --memory-limit 90% or something like that. Maybe it would detect the ending suffix so it would allow values like: 50M, 8G, 93%, etc. That would allow you to deploy the same container config, for example, on multiple machines while having it automatically adjust based on the available memory on the particular machine it is deployed to.
I would go for the memory limit which is more practical than a trail-and-error cool down value. To be clear: the memory limit flag is for the alpha node, right?
Any Dgraph process (zero, alpha, load, bulk) should honour such a limit and make the most of the memory but not exceed it. It can fall back to hard disk but should not rely on swap. The live and bulk loaders also likes to OOM quickly with lots of files. Memory requirement should not scale with the dataset, or the size of your dataset is limited by the size of your machine’s RAM, which is not desirable.
Memory requirement should not scale with the dataset, or the size of your dataset is limited by the size of your machine’s RAM, which is not desirable.
While it is not desirable, there has to be some type of limitation. In order to make queries and load the schema it has to use memory. The more the data, the deeper the query, the more filters, and the larger the schema obviously leads to needing more memory.
I don’t think there is any way of not limiting a dataset to a set amount of memory.
However I think it needs optimization still so that the memory requirement does not scale exponentially but rather proportionately. For example if a dataset of 10 million requires 8Gb (just fabricating numbers here) then a data set of 20 million should require 16Gb and not 64Gb.
Again, I am not saying these numbers above are accurate but just from what I have seen discussed: