Objective
The main objective of this RFC is to introduce Data Masking as a native feature in Dgraph and integrate it into the Access Control List (ACL). Data Masking is a data protection technique that aims to prevent unauthorized access to sensitive information by masking or obscuring it. This feature is important in regions with strict data protection laws, such as Brazil and some countries in Europe, to prevent database administrators or malicious developers from leaking sensitive data. By implementing Data Masking in Dgraph, we can provide our users with an additional layer of security for their data.
Summary
Data masking is a technique that is used to protect sensitive data from unauthorized access or exposure. It involves replacing or obscuring sensitive data with fictitious(but realistic) data that has the same format and characteristics, but does not reveal the original data. It is known as “masked data”. This can be done to prevent data breaches, or to allow for the safe handling of sensitive data during development or testing processes.
One of the main reasons for using data masking is to comply with data privacy laws and regulations, such as the General Data Protection Regulation (GDPR) in Europe and the Lei Geral de Proteção de Dados (LGPD) in Brazil. These laws require companies to implement measures to protect the personal data of their customers and employees from unauthorized access or exposure.
Data masking can be applied to various types of data, such as personal identification numbers, financial information, health records, and more. It can be used in a variety of scenarios, including testing and development, data migration, and data analytics.
To illustrate the use of data masking, consider a company that is storing customer data in a database. This data includes sensitive information such as names, addresses, and credit card numbers. If the company wants to share this data with a third-party vendor for testing or development purposes, it can use data masking to obscure the sensitive data and protect it from being exposed.
In summary, data masking is a valuable tool for protecting sensitive data and complying with data privacy laws. It helps companies to prevent unauthorized access or exposure of sensitive data, while still being able to use it for testing, development, and other purposes.
In practice
There are various methods for implementing data masking, including:
-
Randomization: This involves replacing sensitive data with randomly generated data that preserves the characteristics of the original data, such as data type and length.
-
Pseudonymization: This involves replacing sensitive data with a pseudonym or fake name or identifier that is not directly linked to the original data, but can still be used to identify an individual if necessary. In general it should be a fixed name, but change over time.
-
Substitution: This involves replacing sensitive data with fake data that follows the same format and structure as the original data. For example, replacing a person’s name with a randomly generated name.
-
Data obfuscation: This involves altering sensitive data in a way that makes it difficult to understand or interpret, while still maintaining some level of accuracy or relevance.
-
Data perturbation: This involves slightly altering sensitive data in a controlled and reversible way, so that it is still useful for testing or analysis purposes, but does not reveal any sensitive information.
-
Data suppression: This involves completely removing sensitive data from a dataset, either by deleting it or replacing it with a placeholder value.
-
Data subsetting: only keeping a subset of the data and removing sensitive elements.
-
Data shuffling: randomly rearranging or reordering data to obscure its original meaning.
-
Data blurring: obscuring sensitive data by making it blurry or unreadable.
-
Data redaction: removing or blacking out sensitive data from a document or record.
-
Data generalization: replacing specific data with more general categories or ranges to preserve the overall statistical pattern without revealing sensitive details.
-
Tokenization: This involves replacing sensitive data with a randomly generated token that represents the original data. The original data is then stored separately, and the token is used to access it.
-
End-2-End-Encryption: This involves using a mathematical algorithm to transform sensitive data into a secure, encoded form that can only be decrypted with the proper key.
Data masking is often used in industries that handle sensitive information, such as healthcare, finance, and government. In these industries, data masking can help to comply with regulations such as the General Data Protection Regulation (GDPR) in Europe and the Health Insurance Portability and Accountability Act (HIPAA) in the United States.
In summary, data masking is a valuable tool for protecting sensitive data from unauthorized access. It can be used in a variety of industries to ensure compliance with regulations and to prevent data breaches.
Implementation of Data Masking in Dgraph
One possibility for implementing native Data Masking in Dgraph would be to create a proxy Dgraph Node that communicates with the cluster and functions as a middleware, similar to Learner nodes. This proxy node would not be part of the RAFT consensus, but would have access to the data and the ability to route transactions. This would allow for Data Masking to be integrated into the Dgraph Access Control Layer (ACL). And have a clear separation of it.