ANTLR for parsing languages and specs

mrjn · November 22, 2016, 3:12am

Hey @minions,

The code in our parsers have sort of gone nuts – We have in total 2447 lines of hand written code to parse GQL, and 1000 lines of code to parse RDF. And we have plans to support more languages, like Cypher.

I’ve been thinking about switching over to a parser generator – someone suggested ANTLR4. Doing some search, I realized Cypher already gives out a grammar for ANTLR4; so our job with parsing Cypher would be much easier.

To keep things consistent, and to decrease the amount of hand written code in GQL and RDF packages – I think it makes sense to switch over to ANTLR for those as well.

We’re already swamped with work, so this would be a good project to undertake for a new hire.

robbert229 · November 26, 2016, 7:57am

For completely unrelated reasons to dgraph I am currently in the process of using antlr to generate a parser for cypher and graphql. Since dgraph is opensource if you would like the help, I am always up to contributing.

mrjn · November 26, 2016, 10:38am

Hey @robbert229,

Open Cypher already contains an ANTLR grammar.
http://www.opencypher.org/

With regards to GraphQL, we have made several modifications to GraphQL. GraphQL in its current shape doesn’t work for Graph Databases; so we’ve added graph features in it (like filters, geo, etc.) and removed some of the unnecessary features. It would be great to have an ANTLR grammar generated for this modification, that we could hand out to other graph stores like Cayley, who also wish to support GraphQL.

gpsamson · November 30, 2016, 9:19pm

Hey guys,

I’ve done some research and there’s actually a GraphQL grammar in ANTLR’s repo.

It seems to be a year old, and I’m sure the GraphQL spec has changed a lot and DgraphQL’s (that name fits well) is different as well. I’m up for building a grammar for DgraphQL and updating the parsers. This is something that would take me a month or two as I’m fairly new.

As you mentioned @mrjn cypher already has a grammar package so switching ANTLR would make the implementation of cypher much easier. Furthermore, ANTLR will let users define their own grammar if they prefer, or even modify DgraphQL to fit their needs.

Going to be reading the contribution deadlines and be playing around with Dgraph’s codebase for the week or two to fully understand the moving parts.

mrjn · November 30, 2016, 11:00pm

2 months sounds long. We might move fairly quickly on this. Can’t call it DgraphQL :-). Because, we’re proposing our version of GraphQL as a language of choice for graph databases. Cayley folks are also interested in contributing.

The first thing to do would be to pick up the rdf module and see the sort of performance penalty we have to make to switch that to ANTLR.

gpsamson · November 30, 2016, 11:20pm

I agree, 2 months is long - much of that time is a buffer to understand Dgraph well. I can definitely cut the time down with some guidance from the Dgraph team.

I’ll just take it step by step. I’ll be reading up on the contribution guidelines and getting my machine up and running for development this week. Also, going to be doing a little exploration of the codebase on my own. I’ll reply when I start working towards ANTLR generation and performance comparisons. I’m going to message on Slack if I have specific questions about the codebase if you guys don’t mind.

mrjn · November 30, 2016, 11:41pm

Feel free to always reach out to us about questions. Slack is okay for quicker one-minute questions, discuss is better for long form questions.

gpsamson · November 30, 2016, 11:50pm

I’ll use Slack for questions about the codebase. And use discuss for the actual ANTLR development.

gpsamson · December 2, 2016, 7:22am

@mrjn

What packages would you recommend I put a little more focus on to start ANTLR development. You mentioned picking up the rdf module in an earlier post. I see that the rdf module is heavily used in the mutation handlers. How about the query handlers? Would you suggest picking up the gql module as well?

mrjn · December 2, 2016, 12:59pm

For now, just focus on the rdf module. I think it’s a simpler starting point.

gpsamson · December 6, 2016, 10:34am

I was able to spend some time tonight with ANTLR and threw together a simple RDF grammar. Here’s an example of the parse tree for the simple mutation: <alice> <name> "Alice" .

Taking a look back and forth between the docs (https://wiki.dgraph.io/Queries_and_Mutations) and the RDF module (https://github.com/dgraph-io/dgraph/tree/master/rdf) I’ve notice that there is no validation for the IRI values. For example, I can pass in a subject like $#$% and it will still parse successfully. Will values like this be any trouble for the mutation handler? Also, are there any other validation points I need to know about besides the ones in the docs (https://wiki.dgraph.io/Queries_and_Mutations)?

Throughout the week I’ll be putting together a more complete grammar and comparing the performance of the current RDF module and the ANTLR generated modules.

Just wanted to edit this post with the spec I’ll be following to build the ANTLR grammar: RDF 1.1 N-Quads

mrjn · December 8, 2016, 10:11am

This looks great! The spec you’re following is right. We should follow that as well. Once the grammar is complete, we can just add a couple of more unicode runes which we need to make things work; if necessary. Otherwise, we’ll ensure that the rest of our code abides by RDF rules.

gpsamson · December 8, 2016, 6:20pm

Also, Dgraph’s blank nodes will vary from W3’s spec. Dgraph’s blank node looks like _new_:<identifier> and the W3 spec looks like _:<identifier>. Other than that the grammar should abide by W3’s spec.

mrjn · December 9, 2016, 9:44pm

Filed:

https://github.com/dgraph-io/dgraph/issues/374

Also, we’ll probably use <0xid> to represent UIDs, instead of the _uid_.

gpsamson · December 9, 2016, 9:54pm

Cool, I’ll keep an eye on that issue. For now I’ll build upon the _new_ and _uid_ syntax until they are merged and fixed. I’ve constructed the RDF grammar and I’m building out the listeners in Go. I will push to a forked repo this weekend.

gpsamson · December 18, 2016, 10:27am

The ANTLR4 implementation is now working successfully. Checkout https://github.com/gpsamson/dgraph/blob/improvement/antlr4-rdf/rdf. There are still some things to do such as object types, language tags, validation and formatting, and benchmark tests. As well as proper error handling. Other than that most mutations from https://wiki.dgraph.io/Query_Language_Spec were successful.

Todo:

Object Types
Language Tag
Validation and Formatting (strip brackets, valid characters)
Edit parse_test + Run benchmarks
Proper Error Handling
Vendor in ANTLR4
Format, Code Review, PR and Chill

If I could improve my code in any please let me know. I’m thinking of moving the listener into its own file. Other than that, I’ve been learning a lot about Go and Dgraph.

Motakjuq · January 30, 2017, 7:22am

Sorry to revive this thread, but I didn’t see any update about this, for this reason I’ve been doing some test about antlr4 with GraphQL and it’s incredible slow if you compare with the current parser. I used the same grammar linked previously with some changes, I didn’t spend so much time on this because I wanted to see the performance that antlr4 has.

Some result of my tests:

Antlr4
I just ran one test with a very simple query because I needed more time to adapt the grammar from GraphQL to GraphQL+- to perform different test cases, but after I saw the bad result I didn’t spend more time.

BenchmarkQueryParser/test-4                 2000            986301 ns/op          205508 B/op       5806 allocs/op

Current Parser
I took queries from test files and also I took the queries defined on https://wiki.dgraph.io/Query_Language and ran some tests with the following results:

--- FAIL: BenchmarkCurrentQuery/directors
--- FAIL: BenchmarkCurrentQuery/movies
BenchmarkCurrentQuery/filters-4                   100000             11594 ns/op            6600 B/op         73 allocs/op
BenchmarkCurrentQuery/geq1-4                      200000             12165 ns/op            6192 B/op         63 allocs/op
--- FAIL: BenchmarkCurrentQuery/date
BenchmarkCurrentQuery/alias-4                     300000              5028 ns/op            4688 B/op         41 allocs/op
BenchmarkCurrentQuery/first_after-4               100000             10555 ns/op            6672 B/op         64 allocs/op
BenchmarkCurrentQuery/offset-4                    100000             10365 ns/op            6640 B/op         64 allocs/op
BenchmarkCurrentQuery/generic-4                   200000              9136 ns/op            6224 B/op         61 allocs/op
BenchmarkCurrentQuery/count-4                     200000              5789 ns/op            4976 B/op         45 allocs/op
BenchmarkCurrentQuery/allof-4                     200000              8286 ns/op            5776 B/op         59 allocs/op
--- FAIL: BenchmarkCurrentQuery/root
BenchmarkCurrentQuery/anyof-4                     200000              8972 ns/op            6032 B/op         63 allocs/op
--- FAIL: BenchmarkCurrentQuery/anyof_at_root
BenchmarkCurrentQuery/leq-4                       200000             11685 ns/op            6096 B/op         63 allocs/op
BenchmarkCurrentQuery/geq-4                       100000             10667 ns/op            6128 B/op         63 allocs/op
--- FAIL: BenchmarkCurrentQuery/geo-near
--- FAIL: BenchmarkCurrentQuery/geo-within
--- FAIL: BenchmarkCurrentQuery/geo-contains
--- FAIL: BenchmarkCurrentQuery/geo-intersects
BenchmarkCurrentQuery/filters-or-4                100000             11191 ns/op            6448 B/op         75 allocs/op
BenchmarkCurrentQuery/filters-and-4               100000             11278 ns/op            6448 B/op         75 allocs/op
BenchmarkCurrentQuery/order-4                     200000             10751 ns/op            5952 B/op         55 allocs/op
BenchmarkCurrentQuery/orderdesc-4                 200000              9308 ns/op            6048 B/op         56 allocs/op

Try to not consider the FAIL cases, I had to spend more time configuring the tests with the indexes to get some result on thoses cases, but I didn’t because the results are pretty clear. As you see, the performance between antlr4 and the Current Parser has a difference of around 90x, maybe I could have done some improvement in the grammar used with antlr4 but I doubt to get something like the current parser.

Other tests
I also tried another code generator who parse PEG grammar, so I adapted part of the antlr4 grammar and I added some new non-terminals and terminals to be able to parse the syntax of GraphQL+-, I don’t believe that it’s completed, but it can parse the examples of the Wiki and from some defined tests on git repo (gql directory). The results are obtained using a smaller AST as default.

BenchmarkQuery/directors-4                100000             13685 ns/op            9013 B/op         26 allocs/op
BenchmarkQuery/movies-4                   100000             10945 ns/op            4917 B/op         25 allocs/op
BenchmarkQuery/filters-4                  200000              7176 ns/op            4021 B/op         25 allocs/op
BenchmarkQuery/geq1-4                     200000              6781 ns/op            3893 B/op         25 allocs/op
BenchmarkQuery/date-4                     200000              6910 ns/op            3893 B/op         25 allocs/op
BenchmarkQuery/alias-4                    300000              3775 ns/op            3541 B/op         26 allocs/op
BenchmarkQuery/first_after-4              200000              6822 ns/op            3765 B/op         25 allocs/op
BenchmarkQuery/offset-4                   200000              6706 ns/op            3765 B/op         25 allocs/op
BenchmarkQuery/generic-4                  200000              6219 ns/op            3637 B/op         25 allocs/op
BenchmarkQuery/count-4                    300000              3750 ns/op            3189 B/op         25 allocs/op
BenchmarkQuery/allof-4                    200000              5184 ns/op            3509 B/op         25 allocs/op
BenchmarkQuery/root-4                     300000              4398 ns/op            3349 B/op         25 allocs/op
BenchmarkQuery/anyof-4                    200000              5366 ns/op            3509 B/op         25 allocs/op
BenchmarkQuery/anyof_at_root-4            300000              4583 ns/op            3349 B/op         25 allocs/op
BenchmarkQuery/leq-4                      200000              5947 ns/op            3637 B/op         25 allocs/op
BenchmarkQuery/geq-4                      200000              5963 ns/op            3765 B/op         25 allocs/op
BenchmarkQuery/geo-near-4                 300000              4339 ns/op            3445 B/op         25 allocs/op
BenchmarkQuery/geo-within-4               200000              6859 ns/op            4277 B/op         25 allocs/op
BenchmarkQuery/geo-contains-4             300000              4399 ns/op            3509 B/op         25 allocs/op
BenchmarkQuery/geo-intersects-4           200000              6827 ns/op            4277 B/op         25 allocs/op
BenchmarkQuery/filters-or-4               200000              6226 ns/op            3765 B/op         25 allocs/op
BenchmarkQuery/filters-and-4              200000              6221 ns/op            3765 B/op         25 allocs/op
BenchmarkQuery/order-4                    200000              5839 ns/op            3509 B/op         25 allocs/op
BenchmarkQuery/orderdesc-4                200000              5852 ns/op            3573 B/op         25 allocs/op

These results are only of the lexer and parser, some additional time, used memory and allocations should be required to create the gql.Result who is obtained from the current parser.

Summary

About the antlr code generator

If the project is looking performance, I don’t believe that using the antlr4 generator is a good approach (at least for the go code, I don’t know the performance for other languages), you can use other generator like yacc (included in go tool), the peg generator or another kind of generator. I didn’t test yacc, but if it’s needed a complete comparison, some additional time is required to test it.

About the antlr grammar

Probably, some additional effort will be required to convert antlr4 (from graphql) or ebnf grammar (both used as spec on open cypher) to another more efficient to use on Dgraph.

matthiasg · January 30, 2017, 7:29am

Have you looked at using an Earley parser like GitHub - vnmakarov/yaep: Yet Another Earley Parser ? They are usually quite fast and with low overhead.

mrjn · January 30, 2017, 7:31am

Hi @Motakjuq,

We gave up on ANTLR, it was too slow. Instead, we refactored our code, and it’s working for us.

Motakjuq · January 30, 2017, 9:17am

Hi @matthiasg, I looked that project but its target is C code. I’ve gotten good results with peg, but I do not think I’m going to continue with the grammar because Manish said they’re going to continue as they’ve been working it.

Topic		Replies	Views
Building a Native GraphQL Database: Challenges, Learnings and Future - Dgraph Blog Blog	15	1717	April 26, 2020
Making Dgraph a truly GraphQL-native database - Dgraph Blog Blog	1	678	September 9, 2021
Who’s new in June 2020 Users	34	3765	June 27, 2020
A History Lesson and Challenge for Dgraph Users, Developers, and Management team Users	8	1454	December 19, 2023
[Doc] JSON DQL mutation for GraphQL Schema requires clarification Dgraph Cloud json , area:documentation , dql	21	1845	March 30, 2023

ANTLR for parsing languages and specs

Related topics