Problem with inject data from .NET client using Asynchronous method


(Nguyễn Thanh Phi (Backend MWG)) #1

Hi everyone,
I have RDBMS with huge data (about 50 millions record). I inject to dgraph use .NET client.
Because huge data like that, i can’t run synchronous . So that i run Asynchronous and data has been lost. I injected 1 milions data but recieve only 300k data. This not happen when i run synchronous.
Ps:// when i run asynchronous, always has this error: Assigning IDs is only allowed on leader.
About my system server:
Linux ubuntu 18.04
Dgraph version: 1.0.15
RAM 96GB, HDD 300GB
I have full setup cluster with 3 zero and 3 alpha.
Can anyone help me and explain for me about this problem? Thank too much


(Michael Compton) #2

Hi,

Which .Net client are you using?


(Nguyễn Thanh Phi (Backend MWG)) #3

Hi. This is my .NET client i used


(Michael Compton) #4

Ah, that’s me. Happy to help you out.

Last release of that was for v1.0.14, so it might be a change in v1.0.15 that means I need to cut a new release.

Can you describe exactly how you are doing the insert so I can try to reproduce.


(Nguyễn Thanh Phi (Backend MWG)) #5

very happy to see you :smiley:
I try 3 methods to inject data:

  1. Use normal method:
public static async Task AddJson(string json)
{
        var f = await transaction.Mutate(json);
}
void Program()
{
      await AddJson();
}

=> this is good
2. Use batch IDgraphBatchingClient, sample like this:

var node = await client.GetOrCreateNode("genre" + split[1]);
if (node.IsSuccess) {
var edge = Clients.BuildProperty(node.Value, "name", GraphValue.BuildStringValue(split[0]));
if (edge.IsSuccess){
    await client.BatchAddProperty(edge.Value);
    }
}

=> i inject data but not get the result. I run 1 million data but it doesn’t import to Dgraph. Can you fix it?
I used with your examples here https://github.com/MichaelJCompton/Dgraph-dotnet/blob/master/source/Dgraph-dotnet.examples/MovieLensBatch/MovieLensBatch.cs
3. Use Asynchronous method:

 public static async Task AddJson(string json)
  {
        var f = await transaction.Mutate(json);
}
void Program()
{
      AddJson();
}

=> this is not ok. I inject 1 million but recieve only 300k data.
The different is when i call AddJson(); => run very fast but data not import enough and await AddJson(); => run slow but full data


(Michael Compton) #6

After a quick look at your your two versions with transaction.Mutate(json);, the difference is not to do with the library, but rather that in .Net if you don’t await in main(), then the program will just exit.

So you are starting a task, but the whole program exits before it’s done. That’s why the whole data is not loaded.

I suspect that the BatchingClient is broken. I’ve never used it much and it isn’t covered by any testing. The rest of the lib passes the test suite for Dgraph v1.0.15, so it should be ok.

It’s always going to take quite some work to inject a huge amount of data. I’d suspect that a good way with the .Net client is to run a number json mutations in parallel. Maybe an even better way is to transform the data one-off to rdf and then use Dgraph’s bulk loader - that only works if you want a new database, not if you are adding to an existing one.

Does that help? Let me know if you need more help to get through what you are doing.


(Nguyễn Thanh Phi (Backend MWG)) #7

I hope you can fix BatchingClient soon, because i believe this is the fast method to inject data (faster thanh normal mutate)

In here, i use loop. Like that

public static async Task AddJson(string json)
  {
        var f = await transaction.Mutate(json);
}
void Program()
{
    foreach(var item in listJson)
     {
        AddJson(item);
     }
      
}

(Michael Compton) #8

I’ll try to have a look at batching over the weekend … but it’s kinda a dead area in the code, so not sure.

I’m not sure it should be any faster - it just runs mutations in parallel. So running your own batching should be about the same. It also only really accepts triples and it looks like you’ve converted your data to json. Is that what you built from the RDBMS data?

That for loop still has the same problem - the program will spin quickly through the for loop starting tasks, but it doesn’t wait for the tasks to finish. So some will have started and maybe not finished.

In C# you’d need to start all those tasks and then wait for them to finish … something like

void Program()
{
    var mutateTasks = listJson.Select(AddJson).ToList()
    await Task.WhenAll(mutateTasks); 
}

Better if you collect up the return values and process those.

It might still take a while for the full dataset this way (Dgraph’s bulk loader is really optimised for these tasks). Worth doing some experiments to see which way is going to be quicker overall.