Re: [D] [Proposal] Data Clean Room Platform Prototype Based On Cloudberry [cloudberry]

via GitHub Wed, 30 Jul 2025 17:16:10 -0700


GitHub user houkawa2 edited a discussion: [Proposal] Data Clean Room Platform 
Prototype Based On Cloudberry


### Proposers

houkawa2

### Proposal Status

Under Discussion

### Abstract

This proposal presents the construction of a Data Clean Room (DCR) platform 
based on Cloudberry. The platform adopts Spring Boot as its core framework and 
integrates Kafka with Python-based Multi-Party Computation (MPC) modules to 
achieve high-concurrency and privacy-preserving collaborative data processing.

Each participant can upload their own raw data and define fine-grained access 
control policies — such as which columns can be shared, row-level filters, and 
join keys. These policies are applied to the user-generated encrypted views. 
The views are then imported into the DCR database, where data from different 
users is strictly isolated and inaccessible to others.

During collaborative analysis, the system uses MPC to generate a cross-party 
matching table, and then leverages the SQL engine to execute queries under the 
constraints of the privacy policies. Before results are returned, differential 
privacy noise is added automatically to prevent sensitive information from 
being inferred.

To ensure data security, the platform implements a fine-grained role-based 
access control mechanism: each user has a private database role for managing 
their own data (invisible to others), and a dedicated staging role to upload 
encrypted views to the DCR. Analytical and matching operations within the DCR 
are executed by backend services using specific analysis and matching roles. 
The final query results are stored under a result role dedicated to each user, 
ensuring that only the data owner can access them.

By combining a layered database schema with strict role control, the platform 
achieves its core goal: enabling joint data analysis without exposing raw data 
between parties — thereby meeting the demands of data collaboration with strong 
privacy protection, policy transparency, and regulatory compliance.

### Motivation

In today's data-driven landscape, enterprises frequently need to collaborate on 
sensitive datasets to unlock mutual value. As a Massively Parallel Processing 
(MPP) database, its robust SQL engine and scalability is suitable for 
supporting a DCR platform. 

#### Operating Mode
DCRs on platforms like Snowflake and AWS enable secure multi-party data 
collaboration by leveraging cloud-based distributed storage and compute 
resources. Typically following a **provider-consumer model**, one party creates 
the DCR and invites others to join. Each participant contributes data alongside 
strict privacy rules (e.g., allowed join keys, filters, differential privacy). 
Users can then submit SQL queries, which are executed against protected, 
policy-enforced views. Results are returned only to authorized users, ensuring 
collaborative analysis without exposing raw data.

### Implementation
## Framework
**Spring Boot Logic Layer** -- Responsible for API entry, authentication and 
authorization, request validation, notifications, and DCR backend logic 
implementation. Spring Cloud Config Server or Vault will be used for managing 
role login passwords.

**Python MPC Logic Layer** -- Responsible for implementation of higher-order 
MPC functionalities in tables join.

**Kafka Message Queue** -- Responsible for secure communication between Spring 
Boot and Python MPC services, and providing better concurrency performance.

**Cloudberry Data Layer** -- Responsible for data storage, database role 
authorization, row and column-level access control, aggregation analysis, join, 
and other DCR analysis functionalities.
### Cloudberry Data Layer
The data layer primarily consists of two parts: the user's own database and the 
created DCR database.

**User's Own Database:** Used for storing uploaded raw data. Initial row and 
column-level access restrictions, as well as hashing of sensitive join columns 
and IDs, are also performed within the user's own database. Only encrypted 
views are transferred from the user's database to the DCR database.

**DCR Database:** The DCR database has four layers (schemas): 
data-staging-schema, data-matching-schema, data-analysis-schema, 
data-result-schema.

-- **data-staging-schema:** Used for storing views input from user databases. 
In this layer, different users' views are logically isolated, preventing users 
from seeing each other's data views.

-- **data-matching-schema:** Used for storing the unified cross-reference-table 
for DCR participants, output from the MPC layer.

-- **data-analysis-schema:** Used for table joins and backend view generation 
when users execute SQL queries in the frontend.

-- **data-result-schema:** Used for storing users' SQL query results.

### Spring Boot Logic Layer
#### Role Summary
**super_admin_role:** This is the highest management role in the system, used 
to manage all users' data and the DCR database. It possesses SELECT, INSERT, 
UPDATE, DELETE, CREATE, DROP, and other permissions on all databases.

**User's Personal DB Management Related Roles:**

**user_{userId}_data_extractor_role:** This is the individual user's management 
role. The user has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE, DROP TABLE 
permissions for the databases and tables they create. No one can view or modify 
personal data without authorization.

**DCR DB Related Roles:**

**dcr_admin_role:** This is the DCR database management role, managed by the 
Spring Boot backend. It has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE 
permissions for all schemas within the DCR database.

**dcr_data_staging_{userId}_role:** This role is used by individual users to 
upload views into the DCR database. Each user has their own specific role, and 
users cannot access each other's data. It possesses INSERT and SELECT 
permissions for their own data tables within the `data-staging-schema`.

**dcr_data_matching_role:** Primarily used by the Spring Boot backend to manage 
and maintain the `cross-reference-table` within the `data-matching-schema`. It 
is responsible for receiving the unified `cross-reference-table` data output 
from the MPC layer and ensuring its correct storage and availability for 
subsequent analysis. It has SELECT, INSERT, UPDATE, and DELETE permissions for 
this schema, but cannot CREATE TABLE. Tables should be created by 
`dcr_admin_role` at first.

**dcr_data_analysis_role:** Primarily used by the Spring Boot backend to manage 
and maintain data within the `data-analysis-schema`. When joining tables, it 
needs to fetch data files from the `data-staging-schema` and 
`data-matching-schema` and generate relevant joined views. It also needs to use 
this role to place SQL query results into the `data-result-schema`. Therefore, 
this role holds:

- SELECT permissions on `data-staging-schema` and `data-matching-schema`.
    
- CREATE VIEW, ALTER VIEW, DROP VIEW, SELECT, INSERT, UPDATE, and DELETE 
permissions on `data-analysis-schema`.
    
- CREATE VIEW, ALTER VIEW, DROP VIEW, INSERT, UPDATE, and DELETE permissions on 
`data-result-schema`.
    

**dcr_data_analyst_role:** This is a role specifically used for executing user 
SQL queries, used by the Spring Boot backend. It has SELECT permissions for 
data within the `data-analysis-schema`.

**dcr_data_result_{userId}_role:** Users utilize this role to access their 
respective results. It has SELECT permission for its own results within the 
`data-result-schema`.

#### Spring Boot Internal Database
The Spring Boot internal database is used to store associated data required for 
frontend display.
<img width="1451" height="1239" alt="image" 
src="https://github.com/user-attachments/assets/433647ec-a5f8-41d7-8289-7f9f6f15a522";
 />

There is a many-to-many relationship between the `User` entity and the 
`CleanRoom` entity, thus the `UserCleanRoom` entity is introduced as a junction 
table. Within `UserCleanRoom`, the `userStatus` attribute indicates the role of 
that specific `User` within that `CleanRoom` (e.g., `PROVIDER` or `CONSUMER`).

A `User` has their own `Database`, which contains `Table`s, and `Table`s 
contain `Column`s. Since users need to manually select visible `Column`s when 
configuring row and column-level access policies, a separate `Column` entity 
has been established. There is a many-to-many relationship between `Column` and 
`Policy` (a `Column` can be used in multiple `Policy` settings, and a single 
`Policy`'s row/column access configuration can involve multiple `Column`s). 
Therefore, a `PolicyColumn` junction table is created. In `PolicyColumn`, the 
`isAllowed` attribute indicates whether a specific `Column` is permitted to be 
shared and visible within that `Policy`; the `isJoinKey` attribute indicates 
whether this `Column` is designated as a join column within that `Policy`. The 
detailed policy content will be stored as a JSON-formatted `policyRules` 
attribute within the `Policy` instance.

In my design, `Policy`, `View`, and `Notification` are all dependent on the 
existence of a DCR. If a DCR does not exist, then these three entities also do 
not exist. Furthermore, within the same DCR, there can be different `User`s, 
and each `User` can construct different `View`s, `Policy`s, and `Notification`s 
(as the sender of the notification differs, and the DCR joined also differs). 
Therefore, I have connected these three entities to the `UserCleanRoom` entity.

A `View` is constructed based on a `User`'s `Table` and according to the 
`Policy` set by the `User`. `View` has a many-to-one relationship with both 
`Table` and `Policy`.

The `Notification` entity needs to be connected to the `User` entity because it 
requires a `User` to receive the notification; thus, the relationship is 
many-to-one from `Notification` to `User`. The `status` attribute of 
`Notification` represents the state of the notification, such as whether the 
notification has been sent, whether the recipient has agreed to join the DCR, 
and other enumeration states.

## Required Third-Party Libraries/Plugins
- JSqlParser - Python MPC: PSI Protocol- Spring Cloud Config Server or Vault -  
Firebase Cloud Messaging (FCM)
## Function Points
- **User Registration & Login -- setup & login**
- **User Database Creation**
- **User Uploads CSV/JSON Files:** Do we need a temporary storage path 
accessible by the backend server in cloud? Cloudberry will fecth csv/json files 
from this reachable address and load it into Cloudberry by `gpload`.    
- **Create DCR**
- **Join DCR:** When he initiator of DCR invites someone, the FCM will help 
send notification to that user, after accepting the invitation, user can join 
this DCR successfully.
- **Row and Column-Level Access Control:** Use Cloudberry to apply the rules of 
row and column-level access control.
- **Join**
    - Users configure the join columns in the frontend. The backend, based on 
these frontend parameters, will hash and encrypt the join columns' s value from 
the policy and the IDs of the respective tables.
        
    - Subsequently, by switching to the `dcr_data_staging_{userId}_role`, the 
hashed data will be ingested into the `data-staging-schema` of the DCR database 
using the same `gpload` method as for user-uploaded CSV/JSON files.
        
    - Spring Boot will then transfer this data from 
`dcr_data_staging_{userId}_role` to the MPC layer through message queue.
        
    - After the `cross-reference-table` is generated by MPC module, Spring Boot 
will switch to the `dcr_data_matching_role` to store the 
`cross-reference-table` into the `data-matching-schema`.
        
    - Upon receiving the user's query SQL from the frontend, a third-party SQL 
parser plugin will be used to extract the table names involved in the user's 
join query. Spring Boot will then switch to the `dcr_data_analysis_role` and 
execute dynamically generated join SQL to join the target tables via the 
`cross-reference-table` and views in `data-staging-schema`, storing the result 
in the `data-analysis-schema`.
        
    - Next, Spring Boot will switch to the `dcr_data_analyst_role` and perform 
a `SELECT` query on the target view in `data-analysis-schema` based on the 
dynamically generated SQL.
        
    - Finally, Spring Boot will switch to the `dcr_result_{userId}_role` to 
store the generated result in `data-result-schema`.
        
- **Aggregation:**  The  backend logics will check if user's sql query contains 
aggregation functions which is not allowed. Only the compliant sql query can be 
execute by sql engine. 
-  **User SQL Query**
   - The DCR will provide users with pre-defined queryable tables, columns, and 
aggregation rules. Users can then submit custom SQL query requests through the 
DCR frontend based on these established rules.
   
   - `JSqlParser` is used to parse the user's query SQL, verifying whether the 
aggregation functions used in the SQL (such as SUM, COUNT, AVG) are within the 
list permitted by the DCR's policies.
   
   - Once the query validation is successful, the Spring Boot backend will use 
the permissions of the **`dcr_analyst_role`** to execute the query within the 
DCR database. User queries are performed against existing analytical views 
within the `data-analysis-schema`. These views have inherently encapsulated the 
join logic with the `cross-reference-table` and view data from various parties 
(originating from the `data-staging-schema`) . The database then returns the 
aggregated results to the Spring Boot backend. After the query is completed, 
the Spring Boot backend will apply differential privacy (adding noise) to these 
aggregated results to further protect against the inference of data details.
   
   - The Spring Boot backend, by switching to the permissions of the 
**`dcr_data_analysis_role`**, stores these results into the user's dedicated 
result table (or view) within the `data-result-schema`, making them available 
for subsequent on-demand access.

### Rollout/Adoption Plan

_No response_

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/1270

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] [Proposal] Data Clean Room Platform Prototype Based On Cloudberry [cloudberry]

Reply via email to