Comparison of Protocol Buffers v/s Flatbuffers

mrjn · November 16, 2016, 8:47am

I ran benchmarks to marshal, then unmarshal a list of uint64 uids via Gogo Protocol Buffers and Flatbuffers. Here’re the results.

$ go test --bench . --benchmem                                       ~/go/src/github.com/dgraph-io/experiments/flats
testing: warning: no tests to run
BenchmarkToAndFrom/Proto-10-4         	 3000000	       556 ns/op	     440 B/op	       7 allocs/op
BenchmarkToAndFrom/Flatb-10-4         	 2000000	       717 ns/op	     440 B/op	      12 allocs/op

BenchmarkToAndFrom/Proto-100-4        	  500000	      4255 ns/op	    3960 B/op	      10 allocs/op
BenchmarkToAndFrom/Flatb-100-4        	  500000	      3075 ns/op	    3128 B/op	      18 allocs/op

BenchmarkToAndFrom/Proto-1000-4       	   50000	     35672 ns/op	   35064 B/op	      13 allocs/op
BenchmarkToAndFrom/Flatb-1000-4       	  100000	     29088 ns/op	   26680 B/op	      24 allocs/op

BenchmarkToAndFrom/Proto-10000-4      	    2000	    533612 ns/op	  574200 B/op	      22 allocs/op
BenchmarkToAndFrom/Flatb-10000-4      	   10000	    287535 ns/op	  452920 B/op	      32 allocs/op

BenchmarkToAndFrom/Proto-100000-4     	     300	   4005594 ns/op	 6456064 B/op	      32 allocs/op
BenchmarkToAndFrom/Flatb-100000-4     	    1000	   3359384 ns/op	 3615039 B/op	      38 allocs/op

BenchmarkToAndFrom/Proto-1000000-4    	      50	  35891282 ns/op	63185667 B/op	      42 allocs/op
BenchmarkToAndFrom/Flatb-1000000-4    	     100	  21533788 ns/op	27765056 B/op	      44 allocs/op

BenchmarkToAndFrom/Proto-10000000-4   	       3	 336125866 ns/op	603431680 B/op	      52 allocs/op
BenchmarkToAndFrom/Flatb-10000000-4   	       5	 216030227 ns/op	416909635 B/op	      52 allocs/op
PASS
ok  	github.com/dgraph-io/experiments/flats	31.371s

The naming is as follows:
BenchmarkToAndFrom/<protocol>-<number of uids>-<number of cores>

Flatbuffers is clearly the winner after the first result, where we only had 10 uids. I don’t like the ugliness that Flatbuffers brings to our code base, but clearly it has a significant impact on our performance and memory allocations. And therefore, we should stick to it.

Benchmarking code is here:

jchiu · November 16, 2016, 11:56am

I can try “packed” and “fixed64” some time soon. (uint64 is in varint format which takes longer to parse. Non-packed means we have a tag ID for each element, I believe, and is less efficient when you have a large array.

Update: I realize in proto3, packed is turned on by default.

jchiu · November 17, 2016, 1:03am

BenchmarkToAndFrom/Flatb-10-8         	 2000000	       636 ns/op	     440 B/op	      12 allocs/op
BenchmarkToAndFrom/Proto-10-8         	 3000000	       510 ns/op	     440 B/op	       7 allocs/op
BenchmarkToAndFrom/ProtoFixed-10-8    	 5000000	       311 ns/op	     424 B/op	       7 allocs/op
BenchmarkToAndFrom/Flatb-100-8        	  500000	      2772 ns/op	    3128 B/op	      18 allocs/op
BenchmarkToAndFrom/Proto-100-8        	  500000	      3745 ns/op	    3960 B/op	      10 allocs/op
BenchmarkToAndFrom/ProtoFixed-100-8   	 1000000	      1357 ns/op	    3832 B/op	      10 allocs/op
BenchmarkToAndFrom/Flatb-1000-8       	  100000	     22654 ns/op	   26680 B/op	      24 allocs/op
BenchmarkToAndFrom/Proto-1000-8       	   50000	     33528 ns/op	   35064 B/op	      13 allocs/op
BenchmarkToAndFrom/ProtoFixed-1000-8  	  200000	     10510 ns/op	   32760 B/op	      13 allocs/op
BenchmarkToAndFrom/Flatb-10000-8      	   10000	    217762 ns/op	  452920 B/op	      32 allocs/op
BenchmarkToAndFrom/Proto-10000-8      	    5000	    410033 ns/op	  574200 B/op	      22 allocs/op
BenchmarkToAndFrom/ProtoFixed-10000-8 	   10000	    130797 ns/op	  549624 B/op	      22 allocs/op
BenchmarkToAndFrom/Flatb-100000-8     	    1000	   2445013 ns/op	 3615040 B/op	      38 allocs/op
BenchmarkToAndFrom/Proto-100000-8     	     500	   3760438 ns/op	 6456064 B/op	      32 allocs/op
BenchmarkToAndFrom/ProtoFixed-100000-8         	    1000	   1679391 ns/op	 6259456 B/op	      32 allocs/op
BenchmarkToAndFrom/Flatb-1000000-8             	     100	  18922450 ns/op	27765056 B/op	      44 allocs/op
BenchmarkToAndFrom/Proto-1000000-8             	      50	  31862561 ns/op	63185665 B/op	      42 allocs/op
BenchmarkToAndFrom/ProtoFixed-1000000-8        	     100	  12257054 ns/op	61195008 B/op	      42 allocs/op
BenchmarkToAndFrom/Flatb-10000000-8            	      10	 194547323 ns/op	416909633 B/op	      52 allocs/op
BenchmarkToAndFrom/Proto-10000000-8            	       5	 301868641 ns/op	603431680 B/op	      52 allocs/op
BenchmarkToAndFrom/ProtoFixed-10000000-8       	      10	 109734757 ns/op	583508736 B/op	      52 allocs/op

mrjn · November 17, 2016, 1:32am

Modified the code to add a GC cycle, but it didn’t have much impact on the CPU.

$ go test . -v                                                                                               ~/go/src/github.com/dgraph-io/experiments/flats
=== RUN   TestToAndFrom

Flatb k:10 sz:104
Fixed k:10 sz:82
Proto k:10 sz:100

Flatb k:100 sz:824
Fixed k:100 sz:803
Proto k:100 sz:999

Flatb k:1000 sz:8024
Fixed k:1000 sz:8003
Proto k:1000 sz:9987

Flatb k:10000 sz:80024
Fixed k:10000 sz:80004
Proto k:10000 sz:99936

Flatb k:100000 sz:800024
Fixed k:100000 sz:800004
Proto k:100000 sz:999216

Flatb k:1000000 sz:8000024
Fixed k:1000000 sz:8000005
Proto k:1000000 sz:9992219

Flatb k:10000000 sz:80000024
Fixed k:10000000 sz:80000005
Proto k:10000000 sz:99920961
--- PASS: TestToAndFrom (1.09s)
PASS
ok  	github.com/dgraph-io/experiments/flats	1.097s

mrjn · November 17, 2016, 1:47am

Okay, so I did a couple more things. Tried with the GC inside the loop, but it just made all of them worse, and comparatively, they were the same… so ignore that for now.

The second thing I did was to also iterate over the resulting structure (nl in code), and make that part of the timed loop. This was interesting, because this made FB much worse.

$ go test --bench . --benchmem                                                                               ~/go/src/github.com/dgraph-io/experiments/flats

Flatb k:10 sz:104
Fixed k:10 sz:82
Proto k:10 sz:100

Flatb k:100 sz:824
Fixed k:100 sz:803
Proto k:100 sz:999

Flatb k:1000 sz:8024
Fixed k:1000 sz:8003
Proto k:1000 sz:9987

Flatb k:10000 sz:80024
Fixed k:10000 sz:80004
Proto k:10000 sz:99936

Flatb k:100000 sz:800024
Fixed k:100000 sz:800004
Proto k:100000 sz:999216

Flatb k:1000000 sz:8000024
Fixed k:1000000 sz:8000005
Proto k:1000000 sz:9992219

Flatb k:10000000 sz:80000024
Fixed k:10000000 sz:80000005
Proto k:10000000 sz:99920961

BenchmarkToAndFrom/Flatb-10-4         	 2000000	       850 ns/op	     440 B/op	      12 allocs/op
BenchmarkToAndFrom/Fixed-10-4         	 3000000	       354 ns/op	     424 B/op	       7 allocs/op
BenchmarkToAndFrom/Proto-10-4         	 2000000	       607 ns/op	     440 B/op	       7 allocs/op

BenchmarkToAndFrom/Flatb-100-4        	  300000	      4289 ns/op	    3128 B/op	      18 allocs/op
BenchmarkToAndFrom/Fixed-100-4        	 1000000	      1671 ns/op	    3832 B/op	      10 allocs/op
BenchmarkToAndFrom/Proto-100-4        	  300000	      4304 ns/op	    3960 B/op	      10 allocs/op

BenchmarkToAndFrom/Flatb-1000-4       	   50000	     36327 ns/op	   26680 B/op	      24 allocs/op
BenchmarkToAndFrom/Fixed-1000-4       	  100000	     12521 ns/op	   32760 B/op	      13 allocs/op
BenchmarkToAndFrom/Proto-1000-4       	   50000	     38696 ns/op	   35064 B/op	      13 allocs/op

BenchmarkToAndFrom/Flatb-10000-4      	    5000	    357612 ns/op	  452920 B/op	      32 allocs/op
BenchmarkToAndFrom/Fixed-10000-4      	   10000	    140472 ns/op	  549624 B/op	      22 allocs/op
BenchmarkToAndFrom/Proto-10000-4      	    5000	    424528 ns/op	  574200 B/op	      22 allocs/op

BenchmarkToAndFrom/Flatb-100000-4     	     300	   4071541 ns/op	 3615040 B/op	      38 allocs/op
BenchmarkToAndFrom/Fixed-100000-4     	    1000	   1851157 ns/op	 6259455 B/op	      32 allocs/op
BenchmarkToAndFrom/Proto-100000-4     	     300	   4243123 ns/op	 6456063 B/op	      32 allocs/op

BenchmarkToAndFrom/Flatb-1000000-4    	      50	  31685823 ns/op	27765057 B/op	      44 allocs/op
BenchmarkToAndFrom/Fixed-1000000-4    	     100	  14294377 ns/op	61195010 B/op	      42 allocs/op
BenchmarkToAndFrom/Proto-1000000-4    	      50	  36688009 ns/op	63185665 B/op	      42 allocs/op

BenchmarkToAndFrom/Flatb-10000000-4   	       5	 324814643 ns/op	416909635 B/op	      52 allocs/op
BenchmarkToAndFrom/Fixed-10000000-4   	      10	 130693596 ns/op	583508736 B/op	      52 allocs/op
BenchmarkToAndFrom/Proto-10000000-4   	       3	 351247642 ns/op	603431680 B/op	      52 allocs/op
PASS
ok  	github.com/dgraph-io/experiments/flats	41.644s

Given these results, it’s clear that PBs are better than FB. This result makes me very happy. We have been biting off the complicated API of Flatbuffers for a while, in the hope that it makes things faster for us - but given this benchmark, it’s clear that that’s not the case. The only place that FB helps is in terms of memory allocation per operation - in everything else PB-Fixed is a clear winner.

This is super exiciting, because this would make our code base a lot simpler, and allow mutability over the structures, which @jchiu has been looking for.

jchiu · November 17, 2016, 1:49am

Strongly DISLIKE that I can only put one LIKE here.

Is the decision final? There’s much to change if we choose to get rid of flatbuffers.

mrjn · November 17, 2016, 1:52am

I think we can get rid of FB, yes! You have been complaining about it for a while, and while I’ve been defending FB, I wasn’t happy with what their API does to our code quality. FB is just cumbersome… infact, recently I switched one new endpoint, Backup, to PB, and it’s just so much simpler.

So yeah, unless one of us can find a flaw with the tests, we’re good to go doing the switch from FB to PB!

jchiu · November 17, 2016, 1:55am

I complain but I didn’t do the benchmarking. Apologies about that.

I will be happy to make this migration to protos. It might be quite a heavy change.

mrjn · November 17, 2016, 1:58am

Benchmarking is a great way to prove a point – numbers win arguments easily.

Yeah, you can do it. In fact, I think we might not even need the algo.UIDList anymore, and we can use the memory management techniques to do filtering inplace etc. Let’s do it in a series of PRs, each not more than 300 lines of changes or so… even if they break the build, it’s alright. This is a good chance to clean up our code base, and refactor code as well – so I want to reviewing every PR.

Good luck! And happy to discuss anything related to this.

mrjn · November 28, 2017, 1:00am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PostTraverse and PreTraverse Users	9	1185	November 28, 2017
Curious about the switch from Protobuf -> Flatbuffers Badger kind:question	3	2488	May 28, 2021
Dgraph: JSON vs. Binary clients - Dgraph Blog Blog	0	987	August 18, 2017
Dgraph: JSON vs. Binary clients - Dgraph Blog Blog	0	1019	September 17, 2016
V0.3 Release Notes Users	5	863	November 28, 2017

Comparison of Protocol Buffers v/s Flatbuffers

Related topics