Introducing Time - Dgraph Blog

Time. Whenever working with a system with transactions, then understanding time, and in particular, getting time “right” is a critical requirement. This becomes even more critical when working with multiple servers; we need to know the ordering of events to ensure data consistency.

To date, for a Dgraph database, our transactions are managed by a single machine, and therefore, a single source of time. Yet as the need to have larger and larger databases grows, having a single source of time – while conceptually simple – is insufficient to our needs. We need to have a distributed notion of time and events, and we need to have a mechanism that is provably consistent.

In a short while, Dgraph Labs will be introducing a new top-level repository: github.com/dgraph-io/time. To start, we will focus on a library to support Hybrid Logical Clocks. While this library does not in and of itself “solve” the issue of distributed transactions and temporal consistency, it is a critical component.

What Is a Hybrid Logical Clock?

To understand what a Hybrid Logical Clock is, it is first helpful to understand what a “Lamport Clock” is. After understanding how to use a Lamport Clock to order events, we can then discuss Hybrid Logical Clocks.

Lamport Clocks

A “Lamport Clock” defines a mechanism for using a counter to define the “happens-before” operator between two events in a distributed system. When talking about events, he describes them in terms of occurring on “processes” where each process is effectively single-threaded. The processes may share the same machine, or they may be split across different machines. In any case, these processes are expected to be able to “send and receive” messsages to each other, as well as perhaps have events that occur only on the given process. As each event occurs, we will increment an integer counter.

Let there be three “flavors” of events: the first “flavor” is an event that does not require communicating with any other process, and the second “flavor” is where we have one process sending message to another process, and the third “flavor” is where we have a process receiving a message from another. We will call these “self” events, “send” events, and “receive” events, respectively. For a given process, we require that we are never encountering these events concurrently.

Now, each process will behave according to the following psuedo-code:

var lamportTime := int64(0)
func processSelfEvent() {
    lamportTime += 1
}
func sendEvent(msg MessageType) {
    processSelfEvent()
    send(msg,  lamportTime)
}
func receiveEvent(msg MessageType, remoteTime int64) {
    lamportTime = max(remoteTime, lamportTime)
    processSelfEvent()
    handleMessage(msg)
}

Let us define a “happens-before” operator -->, and let e1 --> e2 be read as “e1 happens before e2”.

On a given process, we might have a sequence of events, e1, e2, … ek. Since we asserted that these events are never concurrent, if event e1 is adjacent and “right before” event e2 in the sequence, then we can denote this as: e1 --> e2. In fact, we can assert: e1 --> e2, e2 --> e3, … Moreover, we can define the --> operator so that it has the transitive property: if e1 --> e2 and e2 --> e3 then e1 --> e3. If we do this, it follows that e1 --> ek.

If event e1 is a send event, and event e2 is a receive event on a different process, then we also denote this as e1 --> e2.

Finally, we complete the definition of --> so that this operator exhibits the transitive property even in the face of cross-process events.

Based on this, we should be able to see that our “happens before” operator is directly captured by the pseudo-code above. For an event e, we denote Clock(e) as the Lamport clock time for the event. If events e1 and e2 occur on the same process, then e1 --> e2 implies Clock(e1) < Clock(e2). Next, if e1 is a send event on some process P1, and e2 is the corresponding receive event on process P2 then we have already established that e1 --> e2. Moreover, this implies that Clock(e1) < Clock(e2) since the receive event will take the timestamp from event e1, and make sure that the event time is greater than that value. Finally, for any sequence of events e1, e2, … such that e0 --> e1, e1 --> e2, ... --> ek, we will find the corresponding results:

Clock(e0) < Clock(e1), ... < Clock(ek)`

Since the < operator is transitive over integers, we see that we also get that e0 --> ek implies Clock(e0) < Clock(ek), corresponding to the transitive-closure result e0 --> ek.

However, just because we can show that “e1 happens before e2” implies that the timestamps follow the correct ordering, we do not get a “total ordering” of events; we cannot arbitrarily compare the Lamport times on different processes and discern which event happened first.

We can have the sequence:

e1, e2, … ek

on one machine, and a related sequence:

E1, E2, … Em

Let us assume that ek is a send event with the corresponding receive event being Em. Given this, we still cannot determine if e1 occurred before or after E1.

We can actually “force” a total-ordering of events by arbitrarily deciding that lower values for Clock(e) imply “happens earlier”, and then, if Clock(a) == Clock(b), we can break ties with a machine ID. In this way, we can use this total-ordering of events to create safe distributed transactions. However, this might allow for some odd anomolies, and is not desirable. For one thing, the event ordering might have very little to do with the actual time.

In Lamport’s paper, he described a mechanism that can remove some of these anomolies, but relied on having physical clocks with “reasonable” precision and agreement (the precise notion of “reasonable” is given in his paper). However, all this does is suggest that the logical timestamp alone is not really sufficient. We need a real clock. We need to be able to look at a timestamp and answer the question “did this occur at 3am or 4am”?

This should give us some motivation for the Hybrid Logical Clock.

Hybrid Logical Clocks From Lamport Clocks

In Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases, we are introduced to a Hybrid Logical Clock. I will refer to this paper later as the “HLC paper”.

As discussed before, one of the shortcomings of Lamport clocks is that they were unrelated to the actual time of an event, so that trying to pin down an exact time value for operations (or sometimes even approximate time) was problematic.

However, we also mentioned for a Lamport clock that “if we just had a sufficiently accurate clock,” we can establish a total-ordering of events without anomolies. This is the starting point for a Hybrid Logical Clock: we will define a clock to have both a “wall time” and a “logical time” (i.e., “Lamport time”).

Consider the following pseudo-code (deployed separately for each independent process):

var lamportTime := int64(0)
var physicalTime := time.Now().UnixNano()
func processSelfEvent() {
    oldPhysicalTime := physicalTime
    physicalTime := max(oldPhysicalTime, time.Now().UnixNano())
    if (physicalTime == oldPhysicalTime) {
        lamportTime += 1
    } else {
        lamportTime = 0
    }
}
func sendEvent(msg MessageType) {
    processSelfEvent()
    send(msg, physicalTime, lamportTime)
}
func receiveEvent(msg MessageType, remotePhysical, remoteLamport int64) {
    oldPhysicalTime := physicalTime
    physicalTime := max(oldPhysicalTime, time.Now().UnixNano(), remotePhysical)
    if (physicalTime == oldPhysicalTime) {
        if (physicalTime == remotePhysical) {
            lamportTime = max(lamportTime, remoteLamport) + 1
        } else {
            lamportTime += 1
        }
    } else if (physicalTime == remotePhysical) {
        lamportTime = remoteLamport + 1
    } else {
        lamportTime = int64(0)
    }
    handleMessage(msg)
}

Consider also, the commparison of two hybrid timestamps:

// compareHybridTs compares the first Hybrid timestamp (physical1, lamport1)
// with the second. It returns -1 if first timestamp "happens before" the
// second, 0 if they are concurrent, and 1 if second timestamp "happens before"
// the first.
func compareHybridTs(physical1, lamport1, physical2, lamport2 int64) int {
    if physical1 != physical2 {
        return sign(physical1 - physical2)
    }
    return sign(lamport1 - lamport2)
}

Based on these definitions and the results we found for the Lamport clock, we can determine that we can obtain a partial ordering of results. Moreover, if the clocks on the different machines are sufficiently close, we can claim that this is a total-ordering, or at the very least “close to” a total ordering.

There are some details we are skipping here: the precise definition of “close to”, and how precisely to use this to establish safe distributed transactions. (In particular, how to deal with potential anomolies). We will save this discussion until we incorporate the HybridClock into the main Dgraph product.

In terms of implementation, the paper suggests using a standard 64-bit signed integer to encode both the physical time (wall-time) and the logical time (Lamport time). The most significant 48 bits are reserved for the physical time and the lower 16 bits are reserved for the logical time. Experimental results from them show that this is “good enough” with respect to being better than NTP synchronization. A bit of math will tell you that this allows for physical time precision of roughly 65 microseconds (2^16 nanoseconds = 65536 nanosends = 65.536 microseconds). Given that optimal practical NTP accuracy is roughly on the order of a millisecond then 65 microseconds can be considered “good enough”.

However, with the advent of clock synchronization mechanisms like Google’s TrueTime or with the Huygens/Huygens-R synchronization, getting precision on the order of 10s of microseconds seems feasible, so it might not be surprising for implementations to deviate from this suggestion.

Choices, Details, History

Clearly, we are not the first to embark on this noble road. Two notable production applications already use this, and these are CockroachDb and YugabyteDb. We happily learn from our predecessors here while acknowledging their contributions.

CockroachDb hlc library

The CockroachDb hlc library can be found at this link . We initially developed our hlc library based on this code-base, though we have diverged from it significantly. A few things are worth noting for it:

First, the CockroachDb Timestamp is encoded using Google’s protobuf libraries, and encodes the physical time as a 64 bit int, and the logical time as a 32 bit int. As such, it diverges from the idea presented by the paper to allow for much more fine-grained access to time, as well as allowing for a very large space for potential overflow.

It is a bit unclear (to this author) if the decision to diverge from the paper’s encoding using a 64 bit int with 48 physical time bits and 16 logical bits was a decision based on experimental evidence, or if there was another motivation behind it (such as predicting that over time, a 64-bit int would simply not suffice). We don’t seriously second-guess this decision on their part, as they undoubtedly had good reason. Nevertheless, it is a bit unfortunate that they lose the ability to encode the Timestamp using a 64-bit int – hence, needing to move away from existing APIs.

At the same time, their decision to encode it using the protobuf libraries is interesting: it immediately implies a serialization mechanism.

At Dgraph Labs, we have not yet decided on how we wish to serialize Timestamps, and will only make that decision when integrating into our Dgraph main product.

Another interesting note is that while their “Update” function appears intended to mimic the “Receive” event described in the paper, they diverge here. Here, their “Update” function is really just setting the current Timestamp held by their Clock to the later of the two values (either local Timestamp value or remote Timestamp value). Again, it is unclear (to this author) what motivated this particular divergence from the original paper. (By the time you read this paper, it would not be surprising if this is ammended by the authors of CockroachDb).

YugabyteDb

YugabyteDb suggests using a 64-bit int for the Timestamp, but instead of having 48 physical time bits and 16 logical bits, they use 52 and 12 bits respectively. This provides for a roughly 4 microsecond precision for their Timestamp. However, there are fewer logical bits available. Will that make a difference?

Consider a situation where we have two machines M1 and M2, and M1 is “faster than” M2 by a full minute. As such, whenever M1 sends a message to M2, it will set M2 ahead relative to its native wall clock. So, if M2 generates “enough” messages before receiving a new mesage from M1, M2 will start to increment the logical clock portion updating its physical time. If it generates 4096 messages during that minute (on average 68+ messages per second for a full minute), then overflow is possible.

Generating 68 or 69 messages per second definitely seems feasible if serving a large bunch of micro-transactions. So, inherently, YugabyteDb has a need to keep clocks in tighter synchronization than as described by the algorithm presented in the paper. However, in a “healthy” datacenter, this should not be a problem.

It is conceivable however, that this possible overflow issue is what prompted the CockroachDb developers to use a 96-bit Timestamp in the first place.

The Timestamp interface and the implementations

At Dgraph Labs, we have not forced a decision on which format to use for the Timestamp. In fact, we simply model a Hybrid Clock Timestamp with the interface:

type Timestamp interface {
	WallTime() int64
	LogicalTime() int32
	... <plus other functions> ...
}

This will live in the package github.com/dgraph-io/time/hlc.

In terms of implementation, we have defined a 64-bit Timestamp – which follows the 52/12 breakdown of physical vs logical time – and a 96-bit Timestamp – which follows the structure of CockroachDb. In addition, we have defined a configurable 64 bit Timestamp, where the number of logical bits is defined by a small configuration object.

When we deploy to the main Dgraph product, we are most likely going to use the 64 bit implementation, since it is really the simplest to integrate. However, by exposing everything through the Timestamp interface, rather than via a specific implementation, we allow for flexibility with respect to switching our choice of implementation.

In addition, we recognize that perhaps, not one-size fits all, and others that might simply want to use our Hybrid Logical Clock solution might have different needs and preferences with respect to the implementation of the Timestamp.

Note that a disadvantage of using the Timestamp interface is that while you can technically (from a code standpoint) mix-and-match the implementation styles, this is not a recommended practice. This is because you can get odd behaviors caused by rounding operations used by Timestamps of different precisions. The “odd behaviors” might entail the fact that for a 64-bit Timestamp ts64 and a 96-bit Timestamp ts96, ts64.Equal(ts96) does not imply ts96.Equal(ts64). This is because for a 64-bit timestamp, the representations will be rounded. It is best to pick an implementation style and stick to it on a given deployment.

The HybridClock interface

In our time repository, we will have the package github.com/dgraph-io/time/hlc, which will define:

type HybridClock interface {
	// NowAsTimestamp() Provides an hlc.Timestamp instance that should match
	// the current time. In terms of the Hybrid Logical Clock algorithms
	// described by the paper, this should follow the notion of
	// a "self" event.
	NowAsTimestamp() Timestamp
	// Update(remoteTime) update the clock's internal Timestamp instance
	// provided that remoteTime is later than or equal to the currently
	// held Timestamp. This mimics the approach defined by CockroachDb.
	Update(remoteTime Timestamp)
	// Receive(remoteTime) updates the current clock time according to
	// the "receiveEvent" algorithm described in the HLC paper.
	Receive(remoteTime Timestamp)
}

There are a couple of other elements to this in the interface, but they are not sufficiently interesting to discuss here. The main mechanism will be to create a new HybridClock instance by using one of three functions:

// NewClock96() creates a HybridClock instance that internally uses a 64 bit
// physical clock and 32 bit logical clock, similar to CockroachDb.
func NewClock96() HybridClock { ... }
// NewClock64() creates a HybridClock instance that internally uses 64-bit
// timestamps with 52 bits for physical time and 12 bits for logical time.
// This follows the suggestion given by YugabyteDb.
func NewClock64() HybridClock { ... }
// NewClock64WithConfig(logicalBits) creates a HybridClock instance where
// each Timestamp instance will use the provided number of logical bits,
// but otherwise uses 64 bits for encoding.
func NewClock64WithConfig(logicalBits uint16) HybridClock { ... }

Going Forward

So far, we have defined the hlc library, and while this is a reasonable start, we do expect to see some changes in the future. In particular, as we start to integrate this into our main product, we are likely to add more. When we do this, we will of course be cognizant of not breaking backwards-compatibility unless we have a very strong reason to do so.

We expect that the most likely source of change will be define serialization mechanisms for the Timestamp object. Currently, you can always get the underlying bit-representations and just serialize those, but it might be better to have a more “definite” ways of performing this serialization.

In addition, in the time repository, we have the timesource package. Currently, we have not really fully developed this, but the idea of creating a kind of stable clock (nicknamed “Metronome”) is in the works. A “Metronome” will be expected to smooth out sudden changes introduced by clock syncronization mechanisms from external sources, like NTP.

Finally – and this is by no means certain – we are at least toying with the idea of implementing the Huygens algorithm – or a derivative of it – for clock syncronization. This, if implemented, is likely to sit on top of the “Metronome” idea.


This is a companion discussion topic for the original entry at https://dgraph.io/blog/post/introducing-time/