Amazon Redshift - Dynamic Data Masking

Huzefa Khan
5 min readMar 14, 2023

--

Many organizations are now utilizing their data warehouse to store sensitive information such as Personally Identifiable Information (PII) and Payment Card Industry (PCI) data. However, the task of securely storing this data while making it easily accessible and optimized for performance and cost can be challenging. One approach to address this challenge is to use dynamic data masking, which allows customers to define rules for exposing only the necessary sensitive data with minimal impact on the application layer. To implement this strategy in Amazon Redshift, follow the steps outlined below.

Masking rules

To implement masking rules in Amazon Redshift, tables can be configured to restrict user access to raw data, with users instead being granted access to a view of the data. Within this view, each field containing Personally Identifiable Information (PII) can be wrapped in a function that dynamically applies the appropriate masking privilege based on the user accessing the data. These functions can also include tags that specify the masking rules to be applied to specific fields.

Masking rules may be different by column based on type (i.e. Email, SSN, Generic, Date of Birth). Each column can be tagged with a type to drive the rule which should be used.

In the following example, I’ve created a sample dataset with Client_info data.

drop table if exists public.Client_info_raw;
create table public.Client_info_raw(id int, first_name varchar(100), last_name varchar(100), login varchar(100), email_address varchar(100));
insert into public.Client_info_raw values
(1,'Jane','Doe','jdoe','jdoe@org.com'),
(2,'John','Doe','jhndoe','jhndoe@org.com'),
(3,'Edward','Jones','ejones','ejones@org.com'),
(4,'Mary','Contrary','mcontrary','mcontrary@org.com');

Now I can create a view which wraps the PII fields in my masking function. This view also joins to the user_entitle table to determine the user's masking privilege. This view is dynamic becuase it leverages the current_user variable which will be different for each user logged into the system.

create or replace view public.Client_info as (
select c.id,
f_mask_varchar(c.first_name, 'name', e.priv) first_name,
f_mask_varchar(c.last_name, 'name', e.priv) last_name,
f_mask_varchar(c.login, 'login', e.priv) login,
f_mask_varchar(c.email_address,'email', e.priv) email
from public.Client_info_raw c
left join public.user_entitle e on (current_user = e.username)
) with no schema binding;

Finally, I can grant access to the client_info view. While in this example the grants to the client_info view have been done for the individual user, it can also be granted at user group level. Notice: I have not granted access to the raw dataset to these users, only the view.

grant select on Client_info to u_fullmask;
grant select on Client_info to u_partialmask;
grant select on Client_info to u_nomask;
grant select on Client_info to u_newuser;

For this view to work, we need to create the masking function f_mask_varchar. For masking rules against other datatypes (e.g. INT, DATE) use the function f_mask_int and f_mask_date. Notice: The function contains partial masking rules for ssn and email but we have also tagged our data with name and login. For those tags the function can use the default masking strategy.

create or replace function f_mask_varchar (varchar, varchar, varchar)
returns varchar
immutable
as $$
select case
when $3 is null then null
when $3 = 'N' then $1
when $3 = 'F' then md5($1)
else case $2
when 'ssn' then substring($1, 1, 7)||'xxxx'
when 'email' then substring(SPLIT_PART($1, '@', 1), 1, 3) + 'xxxx@' + SPLIT_PART($1, '@', 2)
else substring($1, 1, 3)||'xxxxx' end
end
$$ language sql;

Masking privileges can be assigned at the user level and are are applicable to any DB object the user has access to. However, access to the DB object is still controlled at the DB user/group level. If a field is tagged as PII, privileges include:

  1. FullMask — the data is returned obfuscated, but it is not possible to determine the original value.
  2. PartialMask — part of the input value is masked while part is not masked.
  3. NoMasking — the input value is returned.

Masking privileges can be managed through a user entitlement table. Similar to the following:

drop table if exists public.user_entitle;
create table public.user_entitle (username varchar(25), priv varchar(1));

Create some sample users and load them into the entitlement table. Notice: u_newuser has not been inserted to simulate what happens when a user has not been entitled. While in this example the grants to the user_entitle table have been done for the individual user, it can also be granted at user group level.

create user u_fullmask password disable;
create user u_partialmask password disable;
create user u_nomask password disable;
create user u_newuser password disable;

grant select on user_entitle to u_fullmask;
grant select on user_entitle to u_partialmask;
grant select on user_entitle to u_nomask;
grant select on user_entitle to u_newuser;
insert into public.user_entitle values
('u_fullmask', 'F'),
('u_partialmask', 'P'),
('u_nomask', 'N');

Now I can execute my select simulating the experience for different users. The performance for each user is fast and the results are dynamically returned.

SET SESSION AUTHORIZATION 'u_fullmask';
select * from Client_info;
SET SESSION AUTHORIZATION 'u_partialmask';
select * from Client_info;
SET SESSION AUTHORIZATION 'u_nomask';
select * from Client_info;

The users who query the data will use the client_info table, while the ETL code that loads the data will access the client_info_raw table. To implement this strategy, either the user-facing application or the ETL process needs to be modified. However, deploying the view object can minimize the modifications required for the application to work seamlessly.

AWS redshift also provides built-in dynamic data masking but it is in preview release. you use this feature only with test clusters, and not in production environments. The public preview will end on April 5, 2023. https://docs.aws.amazon.com/redshift/latest/dg/t_ddm.html

--

--

Huzefa Khan
Huzefa Khan

Written by Huzefa Khan

Passionate Sr. Data Engineer with years of experience in developing and architecting high-class data solutions https://www.linkedin.com/in/huzzefakhan/

No responses yet