Database Design for GRU's Admin Panel

Hello everyone,

I have created a rough ER for the database: ER-Diagram Link (Open in draw.io for better view)

Need to clarify these tables:

tags: This table is used to store the new tags.
question_tag : Table to associate question with tags.
quiz_info: Admin will first add the quiz info (name, duration), then the available questions can be added into the “quiz” table. “quiz_info” will store the uniquely created quiz, while “quiz” table will store questions associated with particular quiz.

if we won’t add tags to question then the tag related tables can be removed.

So, can’t tags be stored in the same table as questions?
These are a lot of tables. I feel this set up is actually ideally suited for Dgraph :blush:. We should consider using it instead.

You want to store the tags in a separate table if you want to query questions by tags. If they are merely informational/for statistical purposes then they can go in the same table as questions.

This is actually a very small number of tables when compared to a normal DB schema :wink:

I’d recommend getting rid of the tags table and only keep the question_tags table. It will make the queries simpler at a very slight risk of data corruption.

The more I look at this diagram, the more I feel Dgraph is the right DB here. Just the number of relationships involved here – that’s the whole point of Dgraph is to facilitate such relationships – and avoid the explosion of foreign ids. Plus, this would make an excellent use case for us to get real world experience with Dgraph.

Let me draw up a diagram to see how this might look in Dgraph.


Update: This is how think a graph schema might look like.

# Some basic question properties first.
<question-id> <markdown> "content of question in markdown format" .
<question-id> <score> "positive and negative score" .  # Can be put in one field, or split into two.

# Question tag relationship
<question-id> <question.tag> <tag-id> .  # Tag Forward
<tag-id> <tag.question> <question-id> .  # Tag Reverse

# Question answer relationship
<question-id> <question.option> <answer-id> .
<question-id> <question.correct> <answer-id> . # This would be repeated, i.e. the correct answer-id would have two edges inwards.

<answer-id> <option.question> <question-id> .  # Option Reverse

# Answer properties.
<answer-id> <markdown> "content of answer in markdown format" .

# Candidate properties.
<candidate-id> <name> "Candidate name" .
<candidate-id> <email> "candidate email" .

# Candidate relationships.
<candidate-id> <candidate.question> <question-id> .  # Forward
<question-id> <question.candidate> <candidate-id> .  # Reverse. Useful to figure out how many candidates attempted this question.
<candidate-id> <candidate.answer> <answer-id> .
<answer-id> <answer.candidate> <candidate-id> .

# 5 second pings.
<candidate-id> <ping> "some string" .  # This would be stored along with timestamp. So, we can retrieve all the values for the ping sorted by ts. This feature needs to be implemented.

# TODO: Tokens.

Using this schema, these are the kinds of questions we can ask:

# Get questions corresponding to a tag.
query {
	tag(_xid_: tag1) {
		tag.question
	}
}

# This query will give you everything you need to render a candidate report (except ping logs).
query {
	candidate(_xid_: candidateid) {
		candidate.answer     # This would return a list of all answers responded by candidate.
		candidate.question { # Returns list of all questions asked to candidate.
			markdown           # The question content.
			score              # Question score.
			question.tag       # All the tags corresponding to the question.
			question.option {  # The answer options corresponding to the question and it's markdown.
				markdown
			}
			question.correct   # The right answer for the question. This can be matched against candidate.answer list.
			ping               # All ping logs sorted by timestamp.
		}
	}
}

# How many candidates have been asked this question and how many have responded to different options.
query {
	question(_xid_: questionid) {
		question.candidate {
			_count_            # Number of candidates who were asked this question.
		}
		question.option {    # Answer options for the question.
			answer.candidate {
				_count_          # Number of candidates who answerd this option.
			}
		}
	}
}
1 Like

@kostub - If we only keep the question_tags table, then performance of getting information of tags might affect. “question_tags” table will grow much more compare to “tags” table, so finding all unique tags from “question_tags” table will take more time than “tags” table. Like if there are 1000+ question and only 10 unique tags then querying “tags” table with 10 entry will be faster than “question_tags” table with 1000+ entries.

If tags are few and won’t grow much or it is just for meta information, then we can remove these tables. I created tags table for keeping in mind that admin can add new tags from UI. Then when creating new question, admin can select tags from all available tags.

Just like Stackoverflow.com, when user creates a question and types for tags, UI displays the predefined available tags.

As for using Dgraph, need time to explore it.

@Rahul-Sagore It’s a trade-off. Without knowing what sort of queries you anticipate and the frequency at which they occur it is difficult to recommend a design. Here are my assumptions (which may be wrong):

  1. The queries for adding questions will be significantly outnumbered by the queries for reading questions.
  2. The number of unique tags is small (< 1000)
  3. You rarely need to query all unique tags and this can be efficiently cached on the client side.
  4. You don’t need a canonical source of all tags for transactional purposes.

I’d also recommend getting rid of the options table. You never need to query that table independently. Instead store the options as an array (or json if the database doesn’t support arrays) in the questions table. You never need to query options directly for any purpose and it just adds an unnecessary join.

Of course none of this matters if using dgraph. However note that using NoSQL databases adds plenty of complications which may not be worth it for small systems.

1 Like

@kostub I am planning to use go-sql-driver/mysql and I am just normalising the database tables because putting everything in one table is not scalable for the future work.

I created different table for “options” because when I read GRU’s wiki page, I saw question has options and each option has unique id. I did not want to add 8 fields of options (option_id, option_text) in question table.

I liked your suggestion of storing options as json/array, that way there will be only one field in question table, that is good solution.

Hello Everyone,

The above database design is made for the relational database and according to normalization we have to removed the duplicate entry when it occurs in the particular table and split it into different tables.

A “question” can have one or more tags and a particular tag can belong to multiple question, so if we add the tag in the question table it will increase the duplicate entries in the database. It is not a good idea to save the duplicate data into the same table because if you want to fetch the unique tags at any point for any feature, it will be a complex task (lots of computation will happen).

As per the discussion, option table can be merged into question table, while quiz_info table can be merged into quiz table, as we don’t need their data independently. But tags are very crucial in test-prep system, in future if we wish to implement feature related to tags then querying question table to find the tags would be expensive.

We can still put tags in question table, then we can use caching system (memcache/redis), but if somehow in-memory data gets lost we need to get tags data again from the question table, which will be not efficient if there are large number of question rows, compared to master tag table (If have).

I have updated the design of ER Diagram. Please share your thoughts as database design is the crucial part for scalable system to avoid potential problem later on.

That’s exactly the kind of thinking that Dgraph is supposed to fix. That it’s easy to run a Graph (/NoSQL) database, and it’s easy to talk to it. If we ourselves don’t test that Dgraph is easy to run, we can’t really expect our users to trust us. Also, we should eat our own dog food!

I think the main advantage of MySQL would have been if we could just use a couple of tables, and everything was very flat. But, that’s not the case. My hunch is that any use case which requires more than 3-4 tables is a great case for Dgraph. Also, we’re able to do pretty complicated queries pretty easily. Each of the queries I mentioned above would require a lot of queries to MySQL, which we’re able to do in a single query to Dgraph.

@Rahul-Sagore, this might feel new to you. But, you have the best support team available at your hands to guide you through implementation in Dgraph. Nobody else is better qualified to tell you about Dgraph than the @core-devs here :smiley:.

3 Likes

It would serve as a great use case and help us figure out potential shortcomings in the present implementation of Dgraph, hence I think we should try to use it. Even though it might be easier to use the stable SQL databases, we should give Dgraph a try here.

2 Likes

Fair enough. We can use Dgraph also, to test its potential. I’ll explore Dgraph on my own, will discuss if anything comes up.

Another thing: Do we have designs for the UI? Or should I take care of it since we are working on admin panel. If we have designs then it will help a lot.

1 Like

Hi, I have checked Dgraph, read almost everything - graph databases, graphQL, dgraph wiki. Then set up the dgraph and tried example from “Query and Mutation”. It’s really great, like writing query for getting deisred data is simple and intuitive.

Few doubts/questions (Im really new to this):

  • As far as I can understand, set query method in mutation, creates new entity/subject or update the existing one. Right?

  • @mrjn - You have created graph schema for question in above thread, and you have written something like -

    <question-id>

So here id will be dynamic and unique, like <question-1>, <question-2>, <question-3> … and so on. Right?

  • If that is the case, then while running set method from Go code, I have to dynamically generate and add the id like :
    // Running a mutation.
    id = generateQuestionID()
    q := "mutation { set { <question-"+ id +"> <markdown> "+ content +" ."}}"
  • I have tried dgraphgoclient, so the same setup I have to do for Gru server, like adding dependency for grpc, net/content/, dgraphgoclient/graph through govendor and then importing them into my code.

@pawan - The FreeBase Data link on Benchmark Readme is dead. Google discontinued the free base API from 31st Aug 2016. There is this developers.google.com/freebase/ page which has data but that is 30GB in size, so is that the data that you used to query for film/actors. I need to run the film queries, let me know about the data source.

@mrjn - wiki.dgraph.io/Clients implementation for Go language has this code, and I guess the parameter is wrong, ip should be string. I was getting error.

    conn, err := grpc.Dial(127.0.0.1:8081, grpc.WithInsecure())

So far we have been using external ids to represent entities, or having the user specify the UIDs themselves:

We’ll have to modify our code to allow entities to have UIDs automatically. Feel free to file a bug and assign it to me. I’ll get that resolved.

Yeah, the id is unique. It could be filled up manually, like <question-about-stars>, or can be automatically generated. The above bug should allow the latter.

If we generate the id, you’d be providing the entity in terms of _uid_, like the github link I mention above. In fact, have a look at the examples in that test file – it would show what’s valid RDF and what’s not.

@pawan is the author of the Go client. I think the client needs major work to help formulate queries. This would be a good opportunity for us to tidy up the client and make it easy to use, so you don’t have to generate string RDFs (which then have to be parsed by the server, and hence is wasted effort on both sides). We can instead use proto versions of RDFs, which would be easy to generate and parse.

Again, @pawan is the right person to talk about the client. @pawan – think of ways by which we can make the client easier to use.

@Rahul-Sagore Thanks for your feedback about the Go client. I think the client will evolve a lot with your usage of it. I have modified the error(ip should be string) you mentioned in the Wiki. In the future, feel free to make modifications to the Wiki.

I think I will spend a couple of days to make it easier for the Client to generate queries and mutations.

2 Likes

@pawan Thanks, I didn’t know that I can edit the wiki.

@mrjn - I have created an issue, dgraph-io/dgraph/issues/193, unable to find option to assign the task to you, probably permission problem.

1 Like

Hi, this is the modified schema given by @mrjn for dgraph.

``

# Question properties
<question-1> <markdown> "content of question in markdown format" .
<question-1> <positive-score> 5 .
<question-1> <negative-score> 2.5 .
<question-1> <has-multple-answers> boolean .
 
# Tag Properties
<tag-id> <name> "String"

# Question <> Tag Relationship
<question-1> <question.tag> <tag-id> .
<tag-id> <tag.question> <question-1> .

# Option properties
<option-id> <markdown> "content of answer in markdown format"  .

# Question <> Option Relationship
<question-1> <question.option> <option-id> .
<question-1> <question.correct> <option-id> .
<option-id> <option.question> <question-1> .

# Quiz Properties
<quiz-id> <name> "name" .
<quiz-id> <duration> "Time String" .

# Quiz <> Question Relationship
<quiz-id> <quiz.question> <question-1> .
<question-1> <question.quiz> <quiz-id> .

# Candidate Properties
<candidate-1> <name> "Candidate name" .
<candidate-1> <email> "candidate email" .
<candidate-1> <password> "candidate password".
<candidate-1> <authentication_token> "candidate authentication token".

<candidate-1> <ping> "some string" .

#Need to re-check below Schema

#Candidate <> Question Relationship
<candidate-1> <candidate.question> <question-1> .
<question-1> <question.candidate> <candidate-id> .

#Candidate <> Quiz Relationship
<candidate-1> <candidate.quiz> <quiz-id>
<quiz-id> <quiz.candidate> <candidate-1>

#Candidate <> Option Relationship
<candidate-1> <candidate.answer> <option-id> .
<option-id> <option.candidate> <candidate-id> .

``

Hey @Rahul-Sagore, I think a good way to look at this would be whether you can perform all the queries that you would need for the Admin panel/ Report Generation/ Candidate and expand/change the schema according to that.

2 Likes

You probably don’t need that. You can get that via:
{ q(_uid_: question-1) { question.option { _count_ } } }

You probably don’t need the edge in this direction.

Candidate password isn’t required. The auth token is the password. You do need an expiry time for the auth token.

Also, regarding candidate options, you should probably just call them candidate.answer.

Okay, cool. Thanks for the feedback. Will change it.

Btw, @akhiltak, Gru would make a great use case for your GraphQL type PR.