Getting error while doing live loading

What I want to do

I want to do live loading in a single node cluster. although I did the same thing many times this time I am getting an error.
below are the steps I am following.
Live Load:-
Step 1:- start Zero node.

  • ./dgraph zero --my=localhost:5080
    Step 2:- start Alpha node.
    ./dgraph alpha --my=localhost:7080 --zero=localhost:5080
    Step 3:- Live load the data in the Dgraph.
    ./dgraph live -f rdf-file-path -s schema-file-path -zero=localhost:5080 -a=localhost:9080

after live loading, it starts processing files but after 1-2 mins of processing. it gets killed automatically. even didn’t get any error on the terminal. please refer attached screenshot

I tried the same steps on the 2-3 different servers but getting the same issue.

Can you try with half of the dataset? Have you checked the limits of opened files in the OS?

yes, I have checked the limit also. earlier I got the error: too many files open then I increase the limit from 1024 to 10 lakh as well.
I have already completed bulk loading with the same dataset and it is working fine. but I am getting this issue while doing the live load.

Can anyone please help.

Please share the dataset so I can try to reproduce.

The size of dataset is 256 gb and its confidential. I can’t share.
please give any suggestion.

What are the stats of your machine? looks like there is your problem.

1024 GB disc size and 128gb ram.
I have completed bulk load with the same data on the same configuration machine.

No Alpha and Zero logs? Without a way to reproduce and collect data, we can’t find a solution.

Give us some debug logs https://dgraph.io/docs/howto/retrieving-debug-information/#sidebar - Even tho, I think we need a way to reproduce it.

Actually live loading is consuming 124 gb memory and it killed. let me know one thing if we live load data from a folder of rdf file i.e 256gb and if we live load data from a rdf.gz file i.e 24 gb. will there be any difference in live loading or both the ways are same.

because if a live load data from zip file i.e rdf.gz format then it start loading data i.e elapsed. please refer below screenshot. it will not consume much memory only 4 to 5 gb memory has been taken.

and if I live load data from a directory of rdf files then it starts processing in different way. first it showing processing data. please refer below screenshot.
it takes almost all the memory of my system and killed the system after consuming all the memory. in my case I have 124 GiB server. and it took all 124 gb memory and kill the process.

please let me know the difference in both the approaches. why it is taking 124GiB memory while loading data from a folder of RDF file and if read data from rdf.gz format then it takes much less memory. only 4-5 GiB memory.

It might be related to the concurrency of loading multiple files at once. In the last case.

Can you give us a memory profile? https://dgraph.io/docs/howto/retrieving-debug-information/#memory-profile

Hi,
yes, this is the concurrency issue.
can you help me in compressing the data like if I have 256 gb of rdf file in a directory my_data. How can I compress this data in rdf.gz format. eg:- my_data.rdf.gz
it will solve my problem because this compressed data can be easily live-loaded.

Just merge all files in a single RDF and then gzip it normally.

can you please suggest how to merge all files in a single RDF?

In general, we do a custom code for this, but you can use tools like grep, sed, or cat * >

that’s great! Thankyou :slight_smile:
it would be really helpful if you can provide any link for reference. or any example for merging 2-3 rdf files.

cat * >
this command is working for less no of files but I have 1 million files. need to merge them in a single rdf.

need to know 1 more thing If I load files direclty from directory them ot shows processing and I have a zip file when I upload that file then it shlows elapsed. what is it.
please refer below screenshot.
image

Not sure about your question. But the work “elapsed” means “elapsed time”, it is just a time count.

I know that we have several code spread across repositories. This one is an example of code benchmarks/convert/main.go at master · dgraph-io/benchmarks · GitHub But I guess this converts from CVS to RDF. Or from a custom dataset(From Google) to RDF.

You can create your own code for this case based on that principles in that code.