Adding value type to posting list

The NQuads format allows us to specify the type of a literal using ^^ at the end of the literal:

e.g. integers and doubles can be specified as follows:

<http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/atomicNumber> "2"^^<http://www.w3.org/2001/XMLSchema#integer> . 
<http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/specificGravity> "1.663E-4"^^<http://www.w3.org/2001/XMLSchema#double> .   

Right now we take the type if specified and add @@ to the end of the value and append the type. So we will store the integer above as: 2@@http://www.w3.org/2001/XMLSchema#integer in the Posting list.

For geospatial data, I plan to store the data in a binary format, rather than the string representation of the json/WKT. In that case having an @@ at the end of the data that specifies the type is awkward and cumbersome. As this information has to be parsed at query time to figure out how to parse the byte stream.

Instead in the DirectedEdge structure I propose we add a ValueType string field that gets persisted to the posting list. At query time, we can look at the type of the data and then parse the value accordingly (i.e. as string, integer, geospatial data etc.). Thoughts?

You’re right that we don’t need to store the entire RDF url. But, it could be useful for us as we do this dynamic typing at the query or mutation validation time.

The idea is that keep our schema separate from our storage. So, the value type is defined in the schema, and the storage might be anything. You might start off with no schema at all, and we just store whatever you give us. Then add a schema which directs us to parse things as integers. And if we’re able to successfully do that, they’d constitute valid results. The ones we can’t, would be ignored. In other words, if we’re able to successfully parse the data as the value type defined in the schema at query time, then it’s a valid data, otherwise, it’s not.

There’s a lot of advantage in this approach. For e.g., you could change the data type on the fly. Instead of parsing a byte as integer, you could just change the schema and start parsing them as floats. Or, parse them as bools. No changes in the underlying data store are required.

This same goes for more complex data types like actors, or directors, or teachers. Whether some entity is of a certain type is dependent upon the fields that are defined for them, and whether we’re able to successfully parse their values to the defined scalars. This allows a single entity to belong to multiple types. This is similar to how Go implements interfaces.

If you’re storing data in json format, then it should be a string ideally, and you can parse it into the right format at query time (or at indexing time). That’d be in the same design as how we treat everything other value.

Unless, you have specific reasons why your value data can’t be interpreted at query time. If so, then we can discuss how to change the design for that purpose.

I think all of this is great but the storage format still has a type. The schema information can be used to figure out how to interpret the stored data at query/mutation time, but we can achieve much more compact storage if we store the data natively.

e.g. if we want to store the integer 1969. Currently we store it as the string:
"1969@@<long url here>".

On the other hand, we could optimize the storage by storing it as the bytes: 0x07 0xB1 and type as uint16.

This still allows us to do dynamic typing, since after we read the data from RocksDB and parse the binary information, we can still dynamically cast it to whatever type we want, i.e. if the schema asks for a string, we return "1969" as a string rather than an int.

The way this applies to geodata is as follows:
e.g. take the following representation of a polygon geometry:

{
  type: "Polygon",
  coordinates: [ [
      [
        2.8124, 
        51.18802228498421
       ], 
       [
        2.8124, 
        53.4376
       ], 
       [
        4.21885, 
        53.4376
       ], 
       [
        4.21885, 
        52.07174517270202
       ]]] 
}

If we store this data in binary format (WKB), it is basically uninterpretable without knowing that the storage is in the WKB format. The flip side is storing the data in json itself (or in text format (WKT)). This is a fairly verbose format and would be inefficient both for storage and network transfer.

My proposal is that instead of storing the data always as strings (or json representations), we should store the data in binary format. We can still apply dynamic typing to queries and mutations as that type information is applied after reading the data from storage. But we still need a way to interpret the sequence of bytes that is stored in the storage as a particular type. We can do that by associating an optional byte in the data that indicates to us how to parse the byte stream (as a wkb or uint16 or double). The default format could be a string (in utf-8).

1 Like

Hmm… it seems like a fair proposal. My concern is about those edge cases where if we muck with the data we won’t be able to parse it back.

For e.g., say value for film.year “1969@@”. But, not for all the data. Some of the data had just “1969”. For the first case, we convert it to int representation, but the second one gets stored as utf-8 string (well, all Go strings are utf-8). I think your point would be, this is alright because we can still parse them back as directed by the user. Is there any case here, where we would be unable to parse this, on a schema change? or input change?

Think through this, @kostub and @minions. I’d be curious if we can find such edge cases. If we can’t, then the proposal sounds fair and we could implement it. The main principle here of storage and schema separation should NOT be violated though.

Some data to consider:

I stored the boundaries of every single zip code in the US in dgraph. There are 33K zip codes, and each node had 2 attributes, the value of the zip code and the boundary.

The original size of the data is a 307MB gz file where the boundaries are in WKT format. The total storage for each format was:

WKT       1.0 GB
WKB       0.8 GB
GeoJson   1.1 GB

So the storage impact of using the text format is pretty small. The binary format (WKB) only saves us 20% of space.

However, the cost parsing of these formats is quite different. The following are the numbers for parsing each of the formats in memory. The test is for 50 sample polygons from the above file. Note: many of the polygons are pretty huge which is why the time is so large.

BenchmarkParseWKTGeos-4               10         106975905 ns/op
BenchmarkParseWKBGeos-4               50          24209906 ns/op
BenchmarkParseWKBPayne-4             100          11519626 ns/op
BenchmarkParseGeojsonPayne-4           3         367024136 ns/op
BenchmarkParseGeojsonMach-4            5         268319752 ns/op

Here Geos, Payne and Mach are the different libraries I used to parse the data. Payne and Mach are pure Go implementations while Geos is a wrapper around libgeos a library written in C++.

Parsing WKB is an order of magnitude faster than WKT which is 2-3 times as fast as GeoJson.

Caveats:

  • These benchmarks are just for parsing data already in memory and do not include fetching data from rocksDB. It might be the case that, reading the data from rocksDB might be a lot more expensive than parsing and so these comparisons are not that useful.
  • The comparisons were done on fairly large polygons. The performance characteristics may not be the same when reading simpler shapes or single points.

I’ll run benchmarks which include fetching the data from rocksDB and see how that makes a difference.

Note: There is no pure go library for parsing WKT. Payne has plans on adding it, but it is not yet fully implemented.

1 Like

This looks really promising. Here’s what I’d suggest. Can you take a small detour, and implement your data parsing and conversion ideas for the existing data set? As we discussed, I couldn’t think of any concrete edge cases where what you proposed would cause data loss. And the parsing optimization seems very interesting, so it’s worth a try.

Generate similar benchmarks for other data, like dates, ints, floats, etc. Or not. Because it’s pretty obvious that those would be faster when parsed in binary representation than string. So, I’d say, let’s just do it. If it helps in benchmarks, it’s going to help overall.

We do that once, and then we keep the data in memory at least until we exceed the memory threshold. So, access is pretty fast after the first load.

Don’t worry about it. As I said, we keep the PL in memory.

We can just ask our users to give us data in WKT or GeoJSON format?

So the only question is when we translate the data to its canonical value for that type, the user loses the original string. IMO that is not such a big deal, if they truly wanted to keep the original value (for whatever reason), they can always create a separate predicate with a string value for it.

We’ll use GeoJson only for now. The library that parses WKT brings in an additional dependency on a C++ library. twpayne is working on adding WKT support to his library, so whenever it is ready, we can support parsing WKT data directly. For now, we can just ask our users to convert it to GeoJson before uploading.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.