Maybe a small improvement is the redacted value could be of the form
`XXX1...1000` meaning XXX followed by a rand number from 1 to 1000:
XXX54, XXX998, XXX456,... Some randomness would prevent some apps
flattening all rows to a single XXX'ed one, giving a more realistic
redacted data distribution/structure.
I am not sure either about it's value, as that would still break any key
or other cross-referencing.
My 2cts.
On 22/8/22 1:30, Andrés de la Peña wrote:
> If the column names are the same for masked and unmasked data, it would
impact
existing applications. I am curious what the transition plan look
like for applications that expect unmasked data?
For example, let’s say you store SSNs and Birth dates. Upon
enabling this feature, let’s say the app user is not given the
UNMASK permission. Now the app is receiving masked values for
these columns. This is fine for most read only applications.
However, a lot of times these columns may be used as primary keys
or part of primary keys in other tables. This would break existing
applications.
How would this work in mixed mode when ew nodes in the cluster
are masking data and others aren’t? How would it impact the driver?
How would the application learn that the column values are masked?
This is important in case a user has UNMASK permission and then
later taken away. Again this would break a lot of applications.
Changing the masking of a column is a schema change, and as such it
can be risky for existing applications. However, differently to
deleting a column or revoking a SELECT permission, suddenly activating
masking might pass undetected for existing applications.
Applications developed after the introduction of this feature can
check the table schema to know if a column is masked or not. We can
even add a specific system view to ease this, if we think it's worth
it. However, administrators should not activate masking when there
could be applications that are not aware of the feature. We should be
clear about this in the documentation.
This is the way data masking seems to work in the databases I've
checked. I also though that we could just change the name of the
column when it's masked to something as "masked(column_name)", as it
is discussed in the CEP document. This would make it impossible to
miss that a column is masked. However, applications should be prepared
to use different column names when reading result sets, depending on
whether the data is masked for them or not. None of the databases
mentioned on the "other databases" section of the CEP does this kind
of column renaming, so it might be a kind of exotic behaviour. wdyt?
On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña <adelap...@apache.org>
wrote:
> This type of feature is very useful, but it may be easier to
analyze this proposal if it’s compared with other DDM
implementations from other databases? Would it be reasonable
to add a table to the proposal comparing syntax and output
from eg Azure SQL vs Cassandra vs whatever ?
Good idea. I have added a section at the end of the document
briefly describing how some other databases deal with data
masking, and with links to their documentation for the topic. I am
not an expert in none of those databases, so please take my
comments there with a grain of salt.
On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa <jji...@gmail.com> wrote:
This type of feature is very useful, but it may be easier to
analyze this proposal if it’s compared with other DDM
implementations from other databases? Would it be reasonable
to add a table to the proposal comparing syntax and output
from eg Azure SQL vs Cassandra vs whatever ?
On Aug 19, 2022, at 4:50 AM, Andrés de la Peña
<adelap...@apache.org> wrote:
Hi everyone,
I'd like to start a discussion about this proposal for
dynamic data masking:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking
Dynamic data masking allows to obscure sensitive information
without changing the stored data. It would be based on a set
of native CQL functions providing different types of masking,
such as replacing the column value by "XXXX". These functions
could be used as regular functions or attached to table
columns with CREATE/ALTER table. There would be a new UNMASK
permission, so only the users with this permissions would be
able to see the unmasked column values. It would be possible
to customize masking by using UDFs as masking functions.
Thanks,