Enabling Encyption on an Unencrypted Alpha

The current implementation of encryption in Badger doesn’t allow enabling encryption on an unencrypted Alpha. This topic proposes ways of enabling encryption on an unencrypted alpha.

Background

The encryption in Dgraph is supported via Badger. Badger stores the information about encryption in the key registry file and the manifest file. When creating a key registry file, we store a header in the file which denotes if this badger directory is encrypted or unencrypted.


This header is used to verify the encryption key when DB is reopened. When an unencrypted directory is opened with encryption key, this check fails and we return an error.

Problem with Enabling Encryption on Existing Data

Badger writes are append only. We never modify a file once it is written to the disk. If we were to support enabling encrytion on an unencrypted data directory, only the new data would be encrypted and the existing data will be stored in plain text. This is a serious problem. When someone enables encryption, they wouldn’t want half the data to be encrypted and half the data to unencrypted. I propose we do not allow users to enable encryption on existing. The old data might get garbage collected/compacted and re-written in encrypted format but there is a possibity that this might not (never) happen for a long time.

How Does Someone Enable Encryption with Existing Data?

I propose we allow enabling encryption on alpha with existing data via two ways

  1. Backup and Restore: They take a backup of the unencrypted data in Dgraph and restore the backup with encryption enabled. This is similar to how arangodb allows enabling encryption.https://www.arangodb.com/docs/stable/security-encryption.html#limitations . Currently restore supports only encrypted backup to encrypted p dir. It doesn’t support unencrypted backup to encrypted p dir . But this can be added via a flag.
  2. Export and Bulk/Live loader: They can also enable encryption via exporting and importing the data. The data can be exported from an already running alpha and then imported by either bulk or live loader. The new alpha has to be started with --encryption_key xxx flag. The bulk command currently doesn’t support unencrypted export to encrypted p dir (@Paras can correct me if I’m wrong) but this can also be added with the help of command line flags.

I had a discussion with @Paras about this and we think allowing support for encryption via backup/restore and bulk/live import is the best way.

@mrjn @dmai @santo what do you think?

1 Like

Agreed with the analysis.

Doesn’t this contradict what’s in the “Encryption at Rest in Dgraph and Badger” blog post? I had filed issue # #5336 quoting this:

If you have an existing Badger datastore that is not encrypted, enabling encryption on it will not immediately encrypt your existing data all at once. Instead, only new files are encrypted, which will happen as new data is added. As older data gets compacted and newer files generated, those would also get encrypted over time. Badger can run in this hybrid mode easily, as each SSTable and value log file stores the information about the data key used for encryption.

I’m OK with only allowing encryption for new Dgraph Alphas, but we should should be clear that what’s written on the blog doesn’t work and update the post / docs accordingly.

Does this affect Badger too? Can Badger start encrypting an existing unencrypted db?

Yes, it contradicts. I don’t know why we wrote that in the blog posts. We never supported encryption on an unencrypted data directory. Is it possible to modify the blog post?

No, badger doesn’t support encrypting unencrypted data. It has been this way since we added the feature in Badger. We never supported encrypting unencrypted data.

This is news to me! My understanding was that you turn on encryption on plain text Badger, and over time it would encrypt the data. How come a bunch of us had that understanding, and wrote about that in the blog post – when @ibrahim you think otherwise? What went wrong here?

I think, and correct me if I am wrong @ibrahim , that was a use-case but never got implemented in Badger when encryption was introduced.

How difficult would it be to get things to work as per the blog post?

I surmise it is not difficult and eventually everything will be enrcypted. But as @ibrahim mentioned, quoting below, doing this will give a false sense of security to the users. When they turn on encryption, they’d expect everything to be encrypted, right away.

Why? This is how FileVault (Apple’s Encyption) works. New files are encrypted, then old files are slowly encrypted in the background. https://support.apple.com/en-in/HT204837

I also thought that it is possible but I found out few days ago that it is not. I confirmed this with @balaji (who’s worked on encryption in Badger and the blog) a few days ago and it turns out that we don’t support encrypting unencrypted badger. We don’t have any tests for it. All of us assumed it works. I had even reviewed the encryption blog post but didn’t realize the issue with it. Maybe I should’ve manually tested the feature before accepting that it works.

Yes, the original idea was to support it and we thought the current code does that. I found this issue when I was working on an encryption-related ticket.

Not difficult. The key registry file has to be recreated with the new header. I just create a new branch with the fix https://github.com/dgraph-io/badger/tree/ibrahim/encryption-keyregistry . I can clean the code tomorrow and send a PR.

Do we have an estimate on how long it typically takes based on the amount of data (and cpu and ram)? How would this impact a current instance in terms of performance, mutation and query requests?

The FileVault begins encrypting in the background but starts immediately and will finish within a specified period of time.

This is not what Badger will do. It will encrypt when GC’ed/compacted and some data may not even get encrypted ever. Quoting Ibrahim

We could do a db.flatten to do the compaction (and hence encryption) of all the tables at higher levels. And then, run some compaction in the background for the lowest level.

The value log would need to be rewritten – but that’s doable too. We have the mechanisms to do all these, just have to plug them to the right switches.

It would be convenient to our users who have a lot of data. But, you could also argue that switching to encryption is a one time decision, and having to do a backup and restore isn’t too much to ask if they do decide to switch to encryption.

So, we should weigh the code complexity of forcing a rewrite of every table and value log in Badger, over the benefit of turning on encryption over an existing plain-text database.

@Paras yes, I understand this won’t be exactly how file vault works. All I meant is the following:

  • there is precedence to turn on encryption and have it slowly work in the background. There will of course be code for this, I know this is not how it works today.
  • if you could give some sort of api that made a guesstimate of how much encryption has finished, that would be good for users (don’t think anyone needs super accuracy)
  • from my understanding of badger, files are immutable (or append only) and badger is single process. So I’m guessing that files which have to be encrypted aren’t being updated, so updating them should be easy (and you can use some os level APIs to write to a temporary file then swap it out).
  • Further, since encryption itself is side effect free, in case the particular file in-process of being encrypted, but it was going to be compacted or something like that, you can just throw away your encryption and start over.

I’m aware I’m oversimplifying this architecture probably. Most notably the part which records which files are encrypted and which are not. But I still feel it is possible.

@vvbalaji we cannot enable encryption on a running alpha. The alpha has to be restarted with the encryption key. The changes in https://github.com/dgraph-io/badger/tree/ibrahim/encryption-keyregistry a will read a file and recreate then file. The CPU/memory usage won’t be affected. This is similar to deleting a file and recreating it. Also, until badger starts, dgraph cannot serve queries/mutations. The overhead of https://github.com/dgraph-io/badger/tree/ibrahim/encryption-keyregistry is negligible.

I am not in favour of rewriting the value log just to enable encryption. When a value log (.vlog) file is rewritten, we insert more keys in the systems. If the existing database had X keys, after the value log rewrite, we will have 2X keys in the system. Each key foo will have a corresponding !badger!move!foo key (That’s how badger vlog GC works currently). The move keys are unnecessary overhead for badger. The system will have more compactions, the lookup will be slower. Also, the move keys don’t get removed from the system easily (this is a known issue).

You could try to avoid !badger!move! keys by reading and reinserting the same data in badger but then this would be similar to what backup/restore would do

In short, enabling encryption via value log rewrite will be expensive in terms of CPU, might insert more data in the database and it will be slow.

Backup-restore is much faster compared to table/vlog rewrite and we don’t have to write any new code (except adding a new flag). The existing badger backup-restore already does this.

1 Like

@ibrahim Thanks for your comments.

I am assuming backup/restore would take the db offline for mutations till the operation is completed. From your description, value log rewrite appears to have a similar effect on mutations (may be the impact could be limited to the predicates that are rewritten at a given time). If that is the case then backup/restore would be a simpler user experience to add encryption.

For Dgraph, the backup can be done while serving queries/mutations (backup is done via a curl request to alpha). Backups are done on a snapshot. Dgraph will be able to accept new mutations but they won’t be added to the backup. So ideally, the backup should be taken when there are no mutations running so that all the data is added to the backup (@dmai is this the correct understanding of how backups work?)

Backups, like exports, are taken with a particular snapshot of the data based on a timestamp. The user doesn’t need to disable queries/mutations to do a backup.

All the committed data up to that timestamp would be included in the backup. There’s no real need to pause mutations.

I’d be fine with this. Let’s just document it well then.