Maybe a small improvement is the redacted value could be of the form `XXX1...1000` meaning XXX followed by a rand number from 1 to 1000: XXX54, XXX998, XXX456,... Some randomness would prevent some apps flattening all rows to a single XXX'ed one, giving a more realistic redacted data distribution/structure.

I am not sure either about it's value, as that would still break any key or other cross-referencing.

My 2cts.

On 22/8/22 1:30, Andrés de la Peña wrote:

    > If the column names are the same for masked and unmasked data, it would 
impact
    existing applications. I am curious what the transition plan look
    like for applications that expect unmasked data?

    For example, let’s say you store SSNs and Birth dates. Upon
    enabling this feature, let’s say the app user is not given the
    UNMASK permission. Now the app is receiving masked values for
    these columns. This is fine for most read only applications.
    However, a lot of times these columns may be used as primary keys
    or part of primary keys in other tables. This would break existing
    applications.
    How would this work in mixed mode when  ew nodes in the cluster
    are masking data and others aren’t? How would it impact the driver?
    How would the application learn that the column values are masked?
    This is important in case a user has UNMASK permission and then
    later taken away. Again this would break a lot of applications.


Changing the masking of a column is a schema change, and as such it can be risky for existing applications. However, differently to deleting a column or revoking a SELECT permission, suddenly activating masking might pass undetected for existing applications.

Applications developed after the introduction of this feature can check the table schema to know if a column is masked or not. We can even add a specific system view to ease this, if we think it's worth it. However, administrators should not activate masking when there could be applications that are not aware of the feature. We should be clear about this in the documentation.

This is the way data masking seems to work in the databases I've checked. I also though that we could just change the name of the column when it's masked to something as "masked(column_name)", as it is discussed in the CEP document. This would make it impossible to miss that a column is masked. However, applications should be prepared to use different column names when reading result sets, depending on whether the data is masked for them or not. None of the databases mentioned on the "other databases" section of the CEP does this kind of column renaming, so it might be a kind of exotic behaviour. wdyt?

On Fri, 19 Aug 2022 at 19:17, Andrés de la Peña <adelap...@apache.org> wrote:

        > This type of feature is very useful, but it may be easier to
        analyze this proposal if it’s compared with other DDM
        implementations from other databases? Would it be reasonable
        to add a table to the proposal comparing syntax and output
from eg Azure SQL vs Cassandra vs whatever ?

    Good idea. I have added a section at the end of the document
    briefly describing how some other databases deal with data
    masking, and with links to their documentation for the topic. I am
    not an expert in none of those databases, so please take my
    comments there with a grain of salt.

    On Fri, 19 Aug 2022 at 17:30, Jeff Jirsa <jji...@gmail.com> wrote:

        This type of feature is very useful, but it may be easier to
        analyze this proposal if it’s compared with other DDM
        implementations from other databases? Would it be reasonable
        to add a table to the proposal comparing syntax and output
        from eg Azure SQL vs Cassandra vs whatever ?


        On Aug 19, 2022, at 4:50 AM, Andrés de la Peña
        <adelap...@apache.org> wrote:

        
        Hi everyone,

        I'd like to start a discussion about this proposal for
        dynamic data masking:
        
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-20%3A+Dynamic+Data+Masking

        Dynamic data masking allows to obscure sensitive information
        without changing the stored data. It would be based on a set
        of native CQL functions providing different types of masking,
        such as replacing the column value by "XXXX". These functions
        could be used as regular functions or attached to table
        columns with CREATE/ALTER table. There would be a new UNMASK
        permission, so only the users with this permissions would be
        able to see the unmasked column values. It would be possible
        to customize masking by using UDFs as masking functions.

        Thanks,

Reply via email to