Does dgraph support asynchrony when importing rdf into the library

yeahvip · March 23, 2023, 6:44am

Mutation speed is a bit slow. When we put rdf into the library asynchronously to speed up, the alpha takes the errer “Transaction has been aborted. Please retry”. Whether dgraph supports the way of asynchronous entry. In our scenario we can’t use bulk. How can we speed up the entry process?

RJKeevil · March 23, 2023, 8:09am

Asynchronous transactions should be fine, however if you are doing parallel inserts it is normal/expected for some transactions to abort if they write to the same nodes concurrently. Your async code should thus have some form of retry built into it. Alternatively, if you dont need strong consistency, you could try adding @noconflict directives to all of the predicates involved in this insert.

Also to confirm, when you say you cant use bulk, do you mean you cant use the Dgraph bulk loader, or you cant batch multiple inserts into a single transaction? In my opinion batches are the most effective way to get good insert speed, lots of tiny transactions is going to occur a lot of overhead.

yeahvip · March 23, 2023, 8:43am

We currently have three alpha nodes, but there will be 10 clients writing data to the same graph simultaneously. Will this cause an error? Our clients are definitely more than the number of alpha nodes

RJKeevil · March 23, 2023, 8:51am

It is a supported/encouraged pattern, but yes it will cause these errors, because they are “normal errors” that occur in basically all transactional databases when you update the same records at the same time on different threads. Your code needs to retry any transaction conflicts.

yeahvip · March 23, 2023, 9:09am

If we use one client per alpha node to improve the mutation speed, will there be any conflict in the mutation with multiple alpha andclients? All we do only mutate RDF to database? For example, can three alphas use three clients to write rdf to the same graph, and will this approach improve the mutation speed?

RJKeevil · March 23, 2023, 9:20am

Yes, transaction conflicts caused by updating the same data simultaneously on different clients is unrelated to how many alphas you have. Transactional databases (dgraph, postgres etc etc) will see that two edits happen to the same data concurrently and intentionally causes them to fail, as the result might not be deterministic. You can disable this with @noconflict.

RE adding more clients, it will improve speed up to the point that dgraph is under full load already, then it will not help any more. At that point you need to either scale dgraph, or make the transactions themselves more efficient via batches.

yeahvip · March 24, 2023, 2:42am

If our data is already processed into RDF and randomly sent to different alphas for storage using a unified interface, will using @noconflict cause data loss? Alternatively, if we don’t use @noconflict , will there be conflicts due to the generation of data being a program and being sent to different alphas?

yeahvip · March 24, 2023, 7:00am

Furthermore, may I ask if the syntax of connecting multiple “alpha” is correct? In our sample code, we send to three “alpha” each time. However, in our data mutation test, the speed remains basically unchanged when using a cluster of three “alpha”. Should we use more “alpha” to form a cluster or use multiple threads to write to different “alpha” separately?
I hope to receive your answer to assist us in improving the mutation speed.

for alpha in alphaList:

    client_stub = pydgraph.DgraphClientStub(host,port)

    stub_list.append(client_stub)

client = pydgraph.DgraphClient(*stub_list)

RJKeevil · March 24, 2023, 8:10am

Different or the same alpha should make no difference. If you use noconflict and two of your processes send different data at the same time (e.g. uid 0x01 name=“bob”, uid 0x01 name=“dave” then it is not deterministic which name 0x01 will get.

If you have already processed the data in RDFs, can you send e.g. 1000 RDFs in a single transaction? That is the most effective way to speed up your inserts.

RE python, I am just a community member and not from Dgraph, so perhaps they can comment as i use Go, not python. To me your example looks correct.

yeahvip · March 24, 2023, 9:04am

We are transmitting multiple RDFs to the alpha list in a single transaction, and our data for mutation has been processed into the format of uid 0x01 name=“bob”, uid 0x01 name=“dave”. However, the current mutation speed can only reach 10,000 RDFs/s, which is too slow for practical use. Would using multiple “alpha” for mutation improve the mutation speed? Or do you have any suggestion to improve the mutation speed to 50,000 RDFs/s?

RJKeevil · March 24, 2023, 3:04pm

The write speed is dependent on a lot of factors, but yes multiple alphas should help as long as they are not replicas (e.g. set --replicas 1 in your startup command for Zero). In that case writes as divided between alphas, not copied to all alphas. Other important factors are the cpu/mem/disk speed available to the alphas, and how many different predicates are in your insert. Dgraph parallelises by predicate (i.e. 1 cpu per predicate), so if you only have say 4 predicates you will be limited to roughly 4 cores. It would be good to know more about your schema, the machine(s) you are running on, and what is the current cpu usage of dgraph as you load it. It is definitely capabable of more than 50,000 rdfs per second.

Topic		Replies	Views
Transactions with single mutations failing Dgraph	3	705	July 30, 2020
How to run mutation concurrently with python client? Dgraph kind:question	0	512	January 10, 2022
How to avoid errors on rapid queries? Dgraph Clients untagged , dgraph-js	3	657	July 11, 2020
How to solve mutation conflict Dgraph status:accepted , ticket:created	28	1501	February 14, 2023
Getting "Transaction has been aborted. Please retry." Far Too Often Users	6	2628	April 21, 2018

Does dgraph support asynchrony when importing rdf into the library

Related topics