PostTraverse and PreTraverse

Here are some benchmarks.

clear && go test -bench ToJSON$  2>/dev/null | grep -E  'allocs'

  100000	     16909 ns/op	    9514 B/op	     164 allocs/op
   10000	    163722 ns/op	  104197 B/op	    1436 allocs/op
    2000	   1093900 ns/op	  706185 B/op	    9758 allocs/op
  100000	     16467 ns/op	    9466 B/op	     164 allocs/op
   10000	    112430 ns/op	   64920 B/op	    1083 allocs/op
    2000	    791547 ns/op	  442335 B/op	    8241 allocs/op

clear && go test -bench ToJSONWithPre$  2>/dev/null | grep -E  'allocs'

  200000	      9126 ns/op	    5696 B/op	     103 allocs/op
   10000	    111573 ns/op	   73663 B/op	    1201 allocs/op
    2000	    806249 ns/op	  530448 B/op	    8430 allocs/op
  200000	      9341 ns/op	    5601 B/op	     103 allocs/op
   20000	     89101 ns/op	   57184 B/op	     961 allocs/op
    2000	    750159 ns/op	  483459 B/op	    8044 allocs/op

clear && go test -bench ToJSONWithPost$  2>/dev/null | grep -E  'allocs'

  100000	     15524 ns/op	    9377 B/op	     161 allocs/op
   10000	    155709 ns/op	  103303 B/op	    1398 allocs/op
    2000	   1060164 ns/op	  695905 B/op	    9445 allocs/op
  100000	     15717 ns/op	    9330 B/op	     161 allocs/op
   10000	    107320 ns/op	   62986 B/op	    1033 allocs/op
    2000	    750482 ns/op	  424873 B/op	    7698 allocs/op

clear && go test -bench ToPB$  2>/dev/null | grep -E  'allocs'

  500000	      3795 ns/op	    1121 B/op	      29 allocs/op
   30000	     47851 ns/op	   11207 B/op	     273 allocs/op
    5000	    364500 ns/op	   81201 B/op	    1889 allocs/op
  500000	      3869 ns/op	    1086 B/op	      29 allocs/op
   30000	     42084 ns/op	   10960 B/op	     302 allocs/op
    5000	    384766 ns/op	  106124 B/op	    2766 allocs/op

clear && go test -bench ToPBWithPre$  2>/dev/null | grep -E  'allocs'

  300000	      4123 ns/op	     983 B/op	      35 allocs/op
   30000	     50798 ns/op	   11778 B/op	     362 allocs/op
    5000	    376102 ns/op	   86275 B/op	    2515 allocs/op
  300000	      4171 ns/op	     954 B/op	      35 allocs/op
   30000	     44413 ns/op	   11379 B/op	     370 allocs/op
    5000	    399554 ns/op	  110832 B/op	    3345 allocs/op

clear && go test -bench ToPBWithPost$  2>/dev/null | grep -E  'allocs'

  200000	      6901 ns/op	    3815 B/op	      78 allocs/op
   20000	     65827 ns/op	   41504 B/op	     505 allocs/op
    3000	    455502 ns/op	  305112 B/op	    3182 allocs/op
  200000	      7144 ns/op	    3771 B/op	      78 allocs/op
   50000	     38545 ns/op	   22374 B/op	     253 allocs/op
   10000	    177752 ns/op	  108960 B/op	     747 allocs/op

For protos, we do marshal the output. The suffix WithPre means we use a preTraverse routine that is generalized to handle any kind of output. The suffix WithPost means we use a postTraverse routine that is generalized.

For each test case, we see 6 lines. The first three lines are for actors with 10, 100, 1000 entities. The last three lines are for directors with 10, 100, 1000 entities.

In theory, postTraverse is a lot faster than preTraverse when posting lists in a UID matrix have a lot of overlap between one another. I personally prefer postTraverse. I would recommend switching to using only generalized postTraverse. The biggest gain seems to be ToPBWithPost. For one large test case, it is about twice as fast than ToPB or ToPBWithPre, and also uses less memory. However, for another large test case, it seems to be worse.

The tests are passing, so I think these generalized routines are probably correct.

Based on the benchmarks, ToJSONWithPre and ToPBWithPre seem to be the clear winners in almost all cases except maybe one. So, why is your recommendation different?

In theory, PostTraverse could do a lot less work than PreTraverse when there is a lot of overlap between posting lists. It seems to do worse mainly for smaller test cases. I also think there is much room for improvement for PostTraverse. For example, we probably can do without those two hash maps from UID to results.

In short, my “gut feeling” is that PostTraverse is the better solution in the long term, but would need more work. Perhaps until then, we want to keep both around, or switch to using PreTraverse only.

If the main argument is that PostTraverse wins in large data sets, when there’s a lot of overlap, I’d suggest going systematically about it. Divide up the overlap from 0% to 100%, in increments of 10% or so. Pick a large data set size, draw a graph and then see which approach wins.

You can similarly do this for data sets of increasing size. 10,100,1000 is just a start. Try with 10K, 100K, 1M, 10M and see what works there.

Because based on the existing data, Pre works better. But, if you have a gut feeling that post would work better for the above scenarios, then it would make sense to plot it with new data and prove it.

Consider ToPB. The first half is for “Pre” and the second half is for “Post”. The setup is as follows. One node has 5000 neighbors. Each of these neighbors have exactly one neighbor. However, these 5000 neighbors of neighbors can have a lot of overlap. We impose the condition that the number of distinct neighbors of neighbors is N and take N=1,1000,2000,3000,4000,5000. For N=5000, there is no overlap at all. For N=1, there is max overlap. Each of these neighbor of neighbor has about 50 attribute values. Here are the results:

      20	  95105101 ns/op	39867804 B/op	 1060127 allocs/op
      10	 125182514 ns/op	39873173 B/op	 1060197 allocs/op
      10	 125382636 ns/op	39872832 B/op	 1060199 allocs/op
      10	 124874339 ns/op	39873363 B/op	 1060196 allocs/op
      10	 123708389 ns/op	39873888 B/op	 1060204 allocs/op
      10	 119633949 ns/op	39871388 B/op	 1060201 allocs/op

      50	  26269342 ns/op	 7304595 B/op	   16400 allocs/op
      20	  76039460 ns/op	26236110 B/op	  375427 allocs/op
      10	 116598704 ns/op	44899539 B/op	  734520 allocs/op
      10	 144591727 ns/op	58164858 B/op	 1093705 allocs/op
      10	 185467374 ns/op	82228436 B/op	 1452670 allocs/op
       5	 211235918 ns/op	95277497 B/op	 1810790 allocs/op

We see that when there is max overlap, Post is about 4 times faster. When there is min overlap, Post is about twice as slow.

It seems strange that all Pres have about the same number of allocations per operation and bytes per op. In fact, there isn’t much variation between the time taken either for all 5 out of 6 cases. I think there is some mistake here.

I took a look at the code. I don’t see anything obvious why it could be wrong. Both benchmarks used the same function to generate the test cases. And when I think about it, it is not so surprising that all the mem usage is similar. PreTraverse does not take into account any overlap information and will do the same amount of work as we vary the number of unique child SubGraph but keep the total number of nodes / SubGraphs constant. I would expect the running time to be similar as well.

That said, when I run the tests I see warnings like

Group: 0. List: [{NodeId:1 Addr:localhost:12345 Leader:true RaftIdx:36}]
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}
2016/11/08 18:01:08 lists.go:169: Dirty map size: 0
2016-11-08 18:01:08.301162536 +0800 MYT: Syncing memberships
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}
2016/11/08 18:01:08 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:12345: getsockopt: connection refused"; Reconnecting to {"localhost:12345" <nil>}

I am not sure if this makes the benchmarks inaccurate. Might have to disable RAFT to make it work. Sometimes I wonder if a master-slave setup with some replication for masters (like Zookeeper) could make our life easier while not giving up all that much in practice. Currently, I set my test runs to use only one or two CPUs as using more sometimes cause the tests to fail…

Sounds like you are initializing raft stuff. Can I have a look at your code?

Here is the code:

https://github.com/dgraph-io/dgraph/tree/improve/prepost

The benchmarks are run in the following file:

https://github.com/dgraph-io/dgraph/blob/improve/prepost/query/benchmark_test.go

Thanks!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.