Video
Automating Masking for Databricks with Perforce Delphix [Demo]
Data Management,
Security & Compliance
Managing sensitive data in Databricks test and development environments poses major compliance risks, particularly with regulations like GDPR and HIPAA. This video demo shows how the Perforce Delphix platform automates data discovery and irreversible masking directly inside Databricks. See how you can efficiently identify PII, apply robust masking algorithms, and deliver secure, production-like data for analytics and AI — without exposing sensitive information.
Watch the Demo to Learn How To:
- Configure the Delphix API to run masking jobs within a Databricks notebook.
- Execute a sensitive data discovery process to automatically identify PII across tables and schemas.
- Review discovery results and assign specific masking algorithms for different data types.
- Run a masking workflow to create safe, synthetic data and write it to a new catalog.
- Validate the masking process by comparing original and masked data sets.
- Ensure data remains compliant with GDPR, HIPAA, and other regulatory requirements.
See Databricks Masking in Action
See how Delphix solutions makes Databricks data masking easy.
Contact Us for Databricks Masking
Full Transcript
Hello, everyone. This is Jatinder Luthra. Today, I'm back with another solution about masking Databricks data using the Perforce Delphix.
If you have already watched my previous videos, you might know about the Delphix compliance service. In general, the Delphix Compliance Service is a masking service as an API which will allow you to discover and mask sensitive data without managing any of the compliance infrastructure.
The question is, where is masked data used? You can mask the data in Databricks for several different use cases. First, it is used for non-production environments, such as your dev, test, and QA environments in Databricks, to avoid exposing sensitive data. You can use it in your data science and ML projects to train your models without violating privacy or compliance rules. You can also safely share the masked data with your partners or across the team.
Masked data enables insights without compromising PII or other sensitive fields for your AI, analytics, and BI processes. It is also helpful for your compliance and audits, as it meets GDPR, HIPAA, and other regulatory requirements.
Now, let's move to the demo. This is a Jupyter Notebook in Databricks. We're going to read data from one catalog, mask it, and write it to another catalog. In this case, the data we are reading from is the DCS Azure source catalog, and we are going to write to the analytics DCS Azure masked catalog. So let's look at those two catalogs. This is our source catalog. This is our masked catalog. We are focused on the prod app
schema under the source, and the destination is the analytics app
schema under the masked catalog.
Let's start with the discovery process before we do any masking. When we do the discovery, we hold the inventory in another catalog, which we use as a metadata store. It is empty right now.
Let's run a few steps before we run the discovery. We are doing some configurations, as you see. We import what we need, configure the DCS client, configure the Databricks manager, and then initialize the Databricks client to read the local data.
We do some centralized configuration management by passing all the different API secret keys that we need. Those secret keys are stored in Databricks secrets and include your tenant IDs and your DCS API service keys. All of that information is in Databricks secrets. Some helper functions help us along the way. We also do health check and metadata viewer configurations by setting up those functions here.
We want to run the process in parallel, so we set up parallel processing and run the discovery. When we kick off the discovery, it will populate our inventory, and we will see what has been identified. We first have to change our workflow type to discovery at the top, as you can see here, and we are pointing our discovery to scan the app schema. We are not providing a table list here, so it will scan all the tables under the prod app
schema. So now our discovery is running. We will wait for the discovery to complete, and then we will visit our inventory store to see what has been identified. Our discovery process has completed.
So let's see. Our results are populated in inventory, and we see all the results. We will focus on specific columns. What's been identified by the Delphix compliance service discovery are all those different columns identified as sensitive. It then recommended an algorithm to mask it, along with the content score. We will review the results and make some changes. Making changes means applying an algorithm to the assigned algorithm column.
Think of the assigned algorithm column as an instruction for the masking step. The masking step will read the assigned algorithm column. Whatever algorithm is assigned, it will use those algorithms to mask the data in that specific column.
After the changes, for example, if an insurance provider is identified as a full name but is not sensitive, I can remove the algorithm. Some algorithms we apply as-is. For some algorithms, like a personal identity number, we do not want to nullify it. Instead, I use a different algorithm to change the data. Changes have been made after the review of the discovery. Now we are ready to run the masking.
Before we run the masking, let's identify the data we have present in all the different tables. As we see in the original patient details, we have 10,000 rows. For other tables, we have 10,000 rows each, while our masked schema is an empty table.
So let's go ahead and run the masking workflow, and we will revisit and see the new records in the masking schema.
Let's change this workflow to masking, and we will rerun it. We are doing mask and delivery. We are reading from one source, masking it, and writing to another. We can also change it to in-place if you want to do in-place masking. Then we are targeting—we are supplying the information about where to write. We are supplying the masked catalog, which is our destination, and the destination schema, which is the analytics app
schema.
Our write mode is overwrite, which means if there is any data, it will be cleaned up before new data is written. In our case, it is already empty. Let's wait for masking to complete, and then we will look into the data.
All right. Our masking process has completed.
Let's count the number of records across both tables, and we see the data is populated. 10,000 records are present in our destination tables. Let's do some comparisons. For the person table, we will pick some row IDs and do the comparison between the original and masked data. For this specific row ID, we see the first name has been changed, along with the address, city, and email addresses. Now, let's look at the patient details.
Here we see our first names, last names, and even the full names. All those records have been changed with the masked values, with synthetic values. This also includes CPD data. Basically, any columns that had an algorithm applied in our inventory management have been changed and masked.
For some tables, like vehicle data, that do not have any sensitive data, we move the data as-is. So our original and masked data remains the same because nothing is sensitive here. This concludes my masking demo for Databricks using Databricks notebooks. If you are interested in the Jupyter Notebook used in this demo, you can get it from my GitHub repo: mask-databricks-notebook
. You can get this notebook from there along with instructions on how to use this notebook.
Thank you very much. I will see you in the next demo.