Supernode Problem

How does dgraph deal with supernode problem?
For example in,
Twitter, most followed person has ~112 million followers
Instagram, Instagram account has ~331 million followers

That means a single posting list for such vertex:follower is
331 million * 64 bits (UID size) ~= 21,184,000,000 bits ~= 2.7 GBs!

How does dgrah deal with

  1. Queries/Upserts, which try to get all followers? (even with filters, does dgraph load all followers and apply filter, maybe leading to OOM?)
  2. Writes will be impacted (adding/removing follower)? since creating posting list will be expensive?
  3. Impact on cache? (for example changes to posting list are cached in memory and flushed eventually) Could this lead to OOM as well?

I understand that dealing with such supernodes is challenging
So intention behind this question is to

  • Understand limitations/support in dgraph for supernode problem
  • So that me/community is aware of the problems and design their queries/schema/sharding accordingly.

I can leave the other questions to who is familiar with a deep technical view. But this statement took my attention.

PS. I reserve my right to be wrong :stuck_out_tongue:

Are you sure about this statement? from my logical perspective, I follows that 0x1 doesn’t have 64 bits on disk. And I don’t think that 0x1 has the same size as 0x5AF3107A4000 on disk. Even if it is recorded as characters (which I guess it isn’t).

For sure it will take some space on disk, like any other identifier system, but it would be an exponential allocation. Not 64 x 331 million deterministic value. I think 64 bits is the potential of combination for size (which is pretty big).

Also, I think UID is different from things like hash, Base64, UUID (128-bit) and other types which has fixed fields that follows a pattern. For example, SHA512 would always have 128 characters which is bigger than 512 bits on disk (technically it is approximately 128 bytes). So UID isn’t an encryption system.

Another example, let’s say you have UIDs combinations like 0x1 up to 0x5AF3107A4000 ( 3 bytes up to 14 bytes). It is worth mentioning that UID makes combinations with numbers and letters. That is, it can have several nodes with a small number of character combinations.

If you have 310 million of them you would have 542.5 Megabyte (310 m * 14 bytes) maximum. And If you pass a billion nodes, well is there that you’ll have bigger values on disk.

So, if I’m right. The OOM won’t happen in those circumstances you mentioned. But of course, giant databases are quite a challenge.

Cheers.

@MichelDiz
So UID isn’t fixed at 64 bits. It is upto 14 bytes (112 bits)

I got confused from below dgraph docs
Dgraph creates a unique 64 bit identifier https://docs.dgraph.io/mutations/#triples

Assuming UID takes max of 14 bytes, 331 million UIDs will take 4.6 GBs
331,000,000 * 14 = 4,634,000,000 ~= 4.6 GB higher than what my original post said

3 bytes per UID would mean
331,000,000 * 3 = 993,000,000 ~= 0.99 GB

Let’s take average (not scientific, but for the sake of discussion) would mean
2.5GBs close to what my original post said :slight_smile:

So for a single update, that inserts/deletes 1 follower (I would imagine to be very frequent operation, for popular accounts)

  1. Posting list needs to be recreated, since posting lists are sorted
  2. Does this mean complete vertex:follower posting list needs to be memory?
  3. This does not cause any OOM?

UIDs are 64 bits (8 bytes) but compressed using integer compression techniques. I have seen around 10x compression, resulting in approximately 270 MB of posting list for a 331 million list of followers. We split the posting list into multiple key values in badger, each one no larger than 1/2 MB. Of course, querying such a list would take more than usual time given that we will have to read and process all this data. Querying parts of the list should be relatively efficient. Let us know if you have more questions, or any of the questions are still unanswered.

2 Likes

What condition triggers splitting of posting list? Depending on size? if a given posting list gets bigger than 1/2 MB?

When posting list gets smaller, multiple posting lists are merged together?

Supernode would cause the queries to slow-down, but wouldn’t cause memory issues?
For example, to intersect followers between two super nodes?
All the data will be loaded into main memory? i.e ~270*2 ~= 550 MB for a single query!

mostly, yes. This doesn’t happen all the time, only when we merge all the deltas (during the process called Rollup).

we do that too, but only during Rollups.

Probably more!

So there is no check in place to detect it? Maybe not yet, since there’s no query planner yet. (Its part of 2020 roadmap)

Is there a plan to gracefully deny the request, with say “not enough memory error”?
Right now this looks like it can cause OOM depending on other queries that could be running in parallel.

For now, that should be the case. It would be interesting to try it out and see how Dgraph behaves. We haven’t done anything specific to handle this case as of now.

I have one more question related to this.
Lets say someone has a million of tweets, with say schema like below

type Person {
    tweet
}
type Tweet {
    message
    dateandtime
}

tweet: [uid] .
message: string .
dateandtime: dateTime @index(hour) .

If I wish to get the latest 10 tweets, or get tweets from specific time range, how will the dgraph performance be? Maybe good since posting lists are already sorted?

If your query ends up using the datetime index, it would search within a particular hour and should be efficient. For the 10 latest tweets, I don’t think we have optimization in place yet. In any case, I would suggest you do benchmarks for your queries.