RFC: Block users for M minutes after N failed logins

Summary

Block a user for M minutes, if there are N successive failed login attempt. I propose we keep M = 15 and N = 3.

Motivation & Benefits

This is useful from a security point of view. This will make dgraph resilient for brute force password attacks. This feature has been requested by multiple clients including Phillips.

Design Proposal

I have 2 ideas for this task.

Idea 1 :

  • We can store 2 more predicates in user type, <dgraph.failed_login_counter> int . and <dgraph.failed_login_timestamp> datetime . . On every login request, we will check if <dgraph.failed_login_counter> is < N(=3). If this less, it means user is not blocked and we go ahead with authentication. Otherwise we check if M(=15) minutes have passed since <dgraph.failed_login_timestamp>. If request is past M minutes, we set the <dgraph.failed_login_counter> to 0 and proceed for the authentication. Otherwise the authentication request is rejected.
  • Every time login is successful, we set the counter to 0. On unsuccessful login attempt, we increase the counter and check if it is greater that N(=3). If counter exceeds N, we store the timestamp of the login request.

Idea 2 :

  • We store counter and timestamp in memory. And follow the same logic as above.

Comparision:

  • Idea 1 can potentially slow down dgraph as we will have to do a query at every login attempt, whereas this is not the case with Idea 2.
  • Idea 2 has a loophole in it. All the alphas will have different value of counter. If cluster size is significant, one can keep routing the request to different alphas to bypass the blocking mechanism.

I feel Idea 1 is the idiomatic way to do this task.

Tests

We need to write tests for unsuccessful login attempts.

Compatibility

Both the ideas will be compatible with existing systems.

cc: @Paras @pawan @mrjn

How do you identify a non-logged in user? I assume you would be using the IP addresses of the user? You would need to store that too.

Also, this feature should be optional in which case we don’t store or verify the value of these predicates. Only if it is enabled should we track the login requests and block them after a certain time.

If login request comes without RefreshToken, we would assume it is non-logged in user.

I agree that idea 1 is a better approach so the user is blocked cluster-wide. I wouldn’t worry about performance as login is non-frequent event, so we should be good.

1 Like

So if I understand this correctly, we would implement something like fail2ban type of service inside Dgraph for zero and alpha. Is this something normal for other traditional databases or distributed databases implement? or even web servers? Or is this something the that operator implements and puts in front of the web service or database service?

Normally, if dgraph was put on the edge (something normally I would not recommend) instead of a backend private subnet, you would expose it to through loadbalancer endpoint and park a WAF in front of it to get similar functionality.

1 Like

Alpha1 leader can do the tracking of the <user hash> -> <login attempts with timestamps>. And all the other Alphas can subscribe to it, so they get updates of the login attempts. That way, they are pretty up-to-date.

The big question is: Should Alpha1 leader do it, or should Zero leaders do this? My pref would be to do this on the Zero leader. Alphas already have open connections to Zero leader, and we can open a new endpoint in Zero to update it with login attempts and get latest updates back (don’t make it part of membership state).

At the same time, we should also do some research and figure out how other DBs approach this problem.