GitHub user houkawa2 edited a discussion: [Proposal] Data Clean Room Platform
Prototype Based On Cloudberry
### Proposers
houkawa2
### Proposal Status
Under Discussion
### Abstract
Build a data clean room(DCR) platform prototype. I would like to create a **new
DCR database** where all join and query operations will be performed. Processed
view data from participating users, along with the `cross-reference-table`
(generated by the MPC layer), will be stored in different schemas within this
database. We'll establish various permissioned roles to execute operations,
ensuring **minimal data access** for each role.
### Motivation
In today's data-driven landscape, enterprises frequently need to collaborate on
sensitive datasets to unlock mutual value. As a Massively Parallel Processing
(MPP) database, its robust SQL engine and scalability is suitable for
supporting a DCR platform.
## Operating Mode
DCRs built on cloud platforms like Snowflake and AWS leverage the underlying
cloud service providers' distributed storage (e.g., AWS S3) and distributed
compute units (e.g., AWS EC2 instances). While Snowflake operates as a data
platform _running on_ these cloud infrastructures, AWS Clean Rooms is a
dedicated service _built upon_ them.
Both approaches facilitate similar user interaction models, predominantly the
PROVIDER-CONSUMER model and a third-party orchestration/governance model (where
a neutral party coordinates and defines data analysis rules).
The process for using a DCR in the **PROVIDER-CONSUMER model** typically
involves the following steps:
- **DCR Creation and Invitation:** One party initiates the creation of a DCR,
then invites other participating parties to join for secure data collaboration
and analysis.
- **Rule Definition and Data Contribution:** All participating users, acting as
data **PROVIDERs** for their own datasets, are responsible for defining
stringent privacy-enhancing rules. These rules cover data aggregation
constraints, allowed join keys, row-level filters, column-level permissions,
and often include parameters for differential privacy. These rules are applied
to their **contributed data views** to ensure that raw, sensitive data details
are not directly accessible to other participants.
- **Query Submission and Enforcement:** After all participants have configured
and submitted their rules (which are typically displayed for transparency on
the frontend), users (both **PROVIDERs** and **CONSUMERs**) can submit custom
SQL queries via the frontend interface. The DCR's backend engine strictly
_enforces_ all defined rules during query parsing and execution against
privacy-protected, aggregated views of the joined data.
- **Results Retrieval:** Upon completion of the query, all authorized parties
can retrieve their respective analysis results via the frontend interface.
### Implementation
# Building Proposal
This proposal is implemented using Cloudberry as the storage database and SQL
execution engine. Considering Spring Boot's advantages in concurrency for
enterprise-level platforms, I am using Spring Boot as the main framework, with
independent Python microservices providing MPC services. Given the
enterprise-level data volume, Kafka will be used for asynchronous communication
between Spring Boot and Python.
## Framework
**Spring Boot Logic Layer** -- Responsible for API entry, authentication and
authorization, request validation, notifications, and DCR backend logic
implementation. Spring Cloud Config Server or Vault will be used for managing
role login passwords.
**Python MPC Logic Layer** -- Responsible for implementation of higher-order
MPC functionalities in tables join.
**Kafka Message Queue** -- Responsible for secure communication between Spring
Boot and Python MPC services, and providing better concurrency performance.
**Cloudberry Data Layer** -- Responsible for data storage, database role
authorization, row and column-level access control, aggregation analysis, join,
and other DCR analysis functionalities.
### Cloudberry Data Layer
The data layer primarily consists of two parts: the user's own database and the
created DCR database.
**User's Own Database:** Used for storing uploaded raw data. Initial row and
column-level access restrictions, as well as hashing of sensitive join columns
and IDs, are also performed within the user's own database. Only encrypted
views are transferred from the user's database to the DCR database.
**DCR Database:** The DCR database has four layers (schemas):
data-staging-schema, data-matching-schema, data-analysis-schema,
data-result-schema.
-- **data-staging-schema:** Used for storing views input from user databases.
In this layer, different users' views are logically isolated, preventing users
from seeing each other's data views.
-- **data-matching-schema:** Used for storing the unified cross-reference-table
for DCR participants, output from the MPC layer.
-- **data-analysis-schema:** Used for table joins and backend view generation
when users execute SQL queries in the frontend.
-- **data-result-schema:** Used for storing users' SQL query results.
### Spring Boot Logic Layer
This layer acts as a gateway, responsible for API entry, authentication and
authorization, request validation, notifications, and the overall DCR backend
logic implementation. Spring Cloud Config Server or Vault will be used to
manage role login passwords. The specific implementation logic will be detailed
in the subsequent **Function Points** section.
#### Role Summary
In the Spring Boot Logic layer, database file permissions are primarily managed
by Cloudberry's Role functionality, with data isolation achieved through
logical segregation. Below is a description of the roles required to implement
this DCR system and their functionalities.
**super_admin_role:** This is the highest management role in the system, used
to manage all users' data and the DCR database. It possesses SELECT, INSERT,
UPDATE, DELETE, CREATE, DROP, and other permissions on all databases.
**User's Personal DB Management Related Roles:**
**user_{userId}_data_extractor_role:** This is the individual user's management
role. The user has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE, DROP TABLE
permissions for the databases and tables they create. No one can view or modify
personal data without authorization.
**DCR DB Related Roles:**
**dcr_admin_role:** This is the DCR database management role, managed by the
Spring Boot backend. It has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE
permissions for all schemas within the DCR database.
**dcr_data_staging_{userId}_role:** This role is used by individual users to
upload views into the DCR database. Each user has their own specific role, and
users cannot access each other's data. It possesses INSERT and SELECT
permissions for their own data tables within the `data-staging-schema`.
**dcr_data_matching_role:** Primarily used by the Spring Boot backend to manage
and maintain the `cross-reference-table` within the `data-matching-schema`. It
is responsible for receiving the unified `cross-reference-table` data output
from the MPC layer and ensuring its correct storage and availability for
subsequent analysis. It has SELECT, INSERT, UPDATE, and DELETE permissions for
this schema, but cannot CREATE TABLE. Tables should be created by
`dcr_admin_role` at first.
**dcr_data_analysis_role:** Primarily used by the Spring Boot backend to manage
and maintain data within the `data-analysis-schema`. When joining tables, it
needs to fetch data files from the `data-staging-schema` and
`data-matching-schema` and generate relevant joined views. It also needs to use
this role to place SQL query results into the `data-result-schema`. Therefore,
this role holds:
- SELECT permissions on `data-staging-schema` and `data-matching-schema`.
- CREATE VIEW, ALTER VIEW, DROP VIEW, SELECT, INSERT, UPDATE, and DELETE
permissions on `data-analysis-schema`.
- CREATE VIEW, ALTER VIEW, DROP VIEW, INSERT, UPDATE, and DELETE permissions on
`data-result-schema`.
**dcr_data_analyst_role:** This is a role specifically used for executing user
SQL queries, used by the Spring Boot backend. It has SELECT permissions for
data within the `data-analysis-schema`.
**dcr_data_result_{userId}_role:** Users utilize this role to access their
respective results. It has SELECT permission for its own results within the
`data-result-schema`.
#### Spring Boot Internal Database
The Spring Boot internal database is used to store associated data required for
frontend display.
<img width="1451" height="1239" alt="image"
src="https://github.com/user-attachments/assets/433647ec-a5f8-41d7-8289-7f9f6f15a522"
/>
There is a many-to-many relationship between the `User` entity and the
`CleanRoom` entity, thus the `UserCleanRoom` entity is introduced as a junction
table. Within `UserCleanRoom`, the `userStatus` attribute indicates the role of
that specific `User` within that `CleanRoom` (e.g., `PROVIDER` or `CONSUMER`).
A `User` has their own `Database`, which contains `Table`s, and `Table`s
contain `Column`s. Since users need to manually select visible `Column`s when
configuring row and column-level access policies, a separate `Column` entity
has been established. There is a many-to-many relationship between `Column` and
`Policy` (a `Column` can be used in multiple `Policy` settings, and a single
`Policy`'s row/column access configuration can involve multiple `Column`s).
Therefore, a `PolicyColumn` junction table is created. In `PolicyColumn`, the
`isAllowed` attribute indicates whether a specific `Column` is permitted to be
shared and visible within that `Policy`; the `isJoinKey` attribute indicates
whether this `Column` is designated as a join column within that `Policy`. The
detailed policy content will be stored as a JSON-formatted `policyRules`
attribute within the `Policy` instance.
In my design, `Policy`, `View`, and `Notification` are all dependent on the
existence of a DCR. If a DCR does not exist, then these three entities also do
not exist. Furthermore, within the same DCR, there can be different `User`s,
and each `User` can construct different `View`s, `Policy`s, and `Notification`s
(as the sender of the notification differs, and the DCR joined also differs).
Therefore, I have connected these three entities to the `UserCleanRoom` entity.
A `View` is constructed based on a `User`'s `Table` and according to the
`Policy` set by the `User`. `View` has a many-to-one relationship with both
`Table` and `Policy`.
The `Notification` entity needs to be connected to the `User` entity because it
requires a `User` to receive the notification; thus, the relationship is
many-to-one from `Notification` to `User`. The `status` attribute of
`Notification` represents the state of the notification, such as whether the
notification has been sent, whether the recipient has agreed to join the DCR,
and other enumeration states.
## Required Third-Party Libraries/Plugins
- JSqlParser - Python MPC: PSI Protocol- Spring Cloud Config Server or Vault -
Firebase Cloud Messaging (FCM)
## Function Points
- **User Registration & Login -- setup & login**
- **User Database Creation**
- **User Uploads CSV/JSON Files:** Do we need a temporary storage path
accessible by the backend server in cloud? Cloudberry will fecth csv/json files
from this reachable address and load it into Cloudberry by `gpload`.
- **Create DCR**
- **Join DCR:** When he initiator of DCR invites someone, the FCM will help
send notification to that user, after accepting the invitation, user can join
this DCR successfully.
- **Row and Column-Level Access Control:** Use Cloudberry to apply the rules of
row and column-level access control.
- **Join**
- Users configure the join columns in the frontend. The backend, based on
these frontend parameters, will hash and encrypt the join columns' s value from
the policy and the IDs of the respective tables.
- Subsequently, by switching to the `dcr_data_staging_{userId}_role`, the
hashed data will be ingested into the `data-staging-schema` of the DCR database
using the same `gpload` method as for user-uploaded CSV/JSON files.
- Spring Boot will then transfer this data from
`dcr_data_staging_{userId}_role` to the MPC layer through message queue.
- After the `cross-reference-table` is generated by MPC module, Spring Boot
will switch to the `dcr_data_matching_role` to store the
`cross-reference-table` into the `data-matching-schema`.
- Upon receiving the user's query SQL from the frontend, a third-party SQL
parser plugin will be used to extract the table names involved in the user's
join query. Spring Boot will then switch to the `dcr_data_analysis_role` and
execute dynamically generated join SQL to join the target tables via the
`cross-reference-table` and views in `data-staging-schema`, storing the result
in the `data-analysis-schema`.
- Next, Spring Boot will switch to the `dcr_data_analyst_role` and perform
a `SELECT` query on the target view in `data-analysis-schema` based on the
dynamically generated SQL.
- Finally, Spring Boot will switch to the `dcr_result_{userId}_role` to
store the generated result in `data-result-schema`.
- **Aggregation:** The backend logics will check if user's sql query contains
aggregation functions which is not allowed. Only the compliant sql query can be
execute by sql engine.
- **User SQL Query**
- The DCR will provide users with pre-defined queryable tables, columns, and
aggregation rules. Users can then submit custom SQL query requests through the
DCR frontend based on these established rules.
- `JSqlParser` is used to parse the user's query SQL, verifying whether the
aggregation functions used in the SQL (such as SUM, COUNT, AVG) are within the
list permitted by the DCR's policies.
- Once the query validation is successful, the Spring Boot backend will use
the permissions of the **`dcr_analyst_role`** to execute the query within the
DCR database. User queries are performed against existing analytical views
within the `data-analysis-schema`. These views have inherently encapsulated the
join logic with the `cross-reference-table` and view data from various parties
(originating from the `data-staging-schema`) . The database then returns the
aggregated results to the Spring Boot backend. After the query is completed,
the Spring Boot backend will apply differential privacy (adding noise) to these
aggregated results to further protect against the inference of data details.
- The Spring Boot backend, by switching to the permissions of the
**`dcr_data_analysis_role`**, stores these results into the user's dedicated
result table (or view) within the `data-result-schema`, making them available
for subsequent on-demand access.
### Rollout/Adoption Plan
_No response_
### Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
GitHub link: https://github.com/apache/cloudberry/discussions/1270
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]