Re: [D] [Proposal] Data Clean Room Platform Prototype Based On Cloudberry [cloudberry]

via GitHub Wed, 30 Jul 2025 16:26:29 -0700


GitHub user houkawa2 edited a discussion: [Proposal] Data Clean Room Platform 
Prototype Based On Cloudberry


### Proposers

houkawa2

### Proposal Status

Under Discussion

### Abstract

Build a data clean room(DCR) platform prototype. I would like to create a **new 
DCR database** where all join and query operations will be performed. Processed 
view data from participating users, along with the `cross-reference-table` 
(generated by the MPC layer), will be stored in different schemas within this 
database. We'll establish various permissioned roles to execute operations, 
ensuring **minimal data access** for each role.

### Motivation

In today's data-driven landscape, enterprises frequently need to collaborate on 
sensitive datasets to unlock mutual value. As a Massively Parallel Processing 
(MPP) database, its robust SQL engine and scalability is suitable for 
supporting a DCR platform. 

## Operating Mode
DCRs built on cloud platforms like Snowflake and AWS leverage the underlying 
cloud service providers' distributed storage (e.g., AWS S3) and distributed 
compute units (e.g., AWS EC2 instances). While Snowflake operates as a data 
platform _running on_ these cloud infrastructures, AWS Clean Rooms is a 
dedicated service _built upon_ them.

Both approaches facilitate similar user interaction models, predominantly the 
PROVIDER-CONSUMER model and a third-party orchestration/governance model (where 
a neutral party coordinates and defines data analysis rules).

The process for using a DCR in the **PROVIDER-CONSUMER model** typically 
involves the following steps:
- **DCR Creation and Invitation:** One party initiates the creation of a DCR, 
then invites other participating parties to join for secure data collaboration 
and analysis.
    
- **Rule Definition and Data Contribution:** All participating users, acting as 
data **PROVIDERs** for their own datasets, are responsible for defining 
stringent privacy-enhancing rules. These rules cover data aggregation 
constraints, allowed join keys, row-level filters, column-level permissions, 
and often include parameters for differential privacy. These rules are applied 
to their **contributed data views** to ensure that raw, sensitive data details 
are not directly accessible to other participants.
    
- **Query Submission and Enforcement:** After all participants have configured 
and submitted their rules (which are typically displayed for transparency on 
the frontend), users (both **PROVIDERs** and **CONSUMERs**) can submit custom 
SQL queries via the frontend interface. The DCR's backend engine strictly 
_enforces_ all defined rules during query parsing and execution against 
privacy-protected, aggregated views of the joined data.

- **Results Retrieval:** Upon completion of the query, all authorized parties 
can retrieve their respective analysis results via the frontend interface. 

### Implementation

# Building Proposal
This proposal is implemented using Cloudberry as the storage database and SQL 
execution engine. Considering Spring Boot's advantages in concurrency for 
enterprise-level platforms, I am using Spring Boot as the main framework, with 
independent Python microservices providing MPC services. Given the 
enterprise-level data volume, Kafka will be used for asynchronous communication 
between Spring Boot and Python.
## Framework
**Spring Boot Logic Layer** -- Responsible for API entry, authentication and 
authorization, request validation, notifications, and DCR backend logic 
implementation. Spring Cloud Config Server or Vault will be used for managing 
role login passwords.

**Python MPC Logic Layer** -- Responsible for implementation of higher-order 
MPC functionalities in tables join.

**Kafka Message Queue** -- Responsible for secure communication between Spring 
Boot and Python MPC services, and providing better concurrency performance.

**Cloudberry Data Layer** -- Responsible for data storage, database role 
authorization, row and column-level access control, aggregation analysis, join, 
and other DCR analysis functionalities.
### Cloudberry Data Layer
The data layer primarily consists of two parts: the user's own database and the 
created DCR database.

**User's Own Database:** Used for storing uploaded raw data. Initial row and 
column-level access restrictions, as well as hashing of sensitive join columns 
and IDs, are also performed within the user's own database. Only encrypted 
views are transferred from the user's database to the DCR database.

**DCR Database:** The DCR database has four layers (schemas): 
data-staging-schema, data-matching-schema, data-analysis-schema, 
data-result-schema.

-- **data-staging-schema:** Used for storing views input from user databases. 
In this layer, different users' views are logically isolated, preventing users 
from seeing each other's data views.

-- **data-matching-schema:** Used for storing the unified cross-reference-table 
for DCR participants, output from the MPC layer.

-- **data-analysis-schema:** Used for table joins and backend view generation 
when users execute SQL queries in the frontend.

-- **data-result-schema:** Used for storing users' SQL query results.


### Spring Boot Logic Layer
This layer acts as a gateway, responsible for API entry, authentication and 
authorization, request validation, notifications, and the overall DCR backend 
logic implementation. Spring Cloud Config Server or Vault will be used to 
manage role login passwords. The specific implementation logic will be detailed 
in the subsequent **Function Points** section.

#### Role Summary
In the Spring Boot Logic layer, database file permissions are primarily managed 
by Cloudberry's Role functionality, with data isolation achieved through 
logical segregation. Below is a description of the roles required to implement 
this DCR system and their functionalities.

**super_admin_role:** This is the highest management role in the system, used 
to manage all users' data and the DCR database. It possesses SELECT, INSERT, 
UPDATE, DELETE, CREATE, DROP, and other permissions on all databases.

**User's Personal DB Management Related Roles:**

**user_{userId}_data_extractor_role:** This is the individual user's management 
role. The user has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE, DROP TABLE 
permissions for the databases and tables they create. No one can view or modify 
personal data without authorization.

**DCR DB Related Roles:**

**dcr_admin_role:** This is the DCR database management role, managed by the 
Spring Boot backend. It has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE 
permissions for all schemas within the DCR database.

**dcr_data_staging_{userId}_role:** This role is used by individual users to 
upload views into the DCR database. Each user has their own specific role, and 
users cannot access each other's data. It possesses INSERT and SELECT 
permissions for their own data tables within the `data-staging-schema`.

**dcr_data_matching_role:** Primarily used by the Spring Boot backend to manage 
and maintain the `cross-reference-table` within the `data-matching-schema`. It 
is responsible for receiving the unified `cross-reference-table` data output 
from the MPC layer and ensuring its correct storage and availability for 
subsequent analysis. It has SELECT, INSERT, UPDATE, and DELETE permissions for 
this schema, but cannot CREATE TABLE. Tables should be created by 
`dcr_admin_role` at first.

**dcr_data_analysis_role:** Primarily used by the Spring Boot backend to manage 
and maintain data within the `data-analysis-schema`. When joining tables, it 
needs to fetch data files from the `data-staging-schema` and 
`data-matching-schema` and generate relevant joined views. It also needs to use 
this role to place SQL query results into the `data-result-schema`. Therefore, 
this role holds:

- SELECT permissions on `data-staging-schema` and `data-matching-schema`.
    
- CREATE VIEW, ALTER VIEW, DROP VIEW, SELECT, INSERT, UPDATE, and DELETE 
permissions on `data-analysis-schema`.
    
- CREATE VIEW, ALTER VIEW, DROP VIEW, INSERT, UPDATE, and DELETE permissions on 
`data-result-schema`.
    

**dcr_data_analyst_role:** This is a role specifically used for executing user 
SQL queries, used by the Spring Boot backend. It has SELECT permissions for 
data within the `data-analysis-schema`.

**dcr_data_result_{userId}_role:** Users utilize this role to access their 
respective results. It has SELECT permission for its own results within the 
`data-result-schema`.

#### Spring Boot Internal Database
The Spring Boot internal database is used to store associated data required for 
frontend display.
<img width="1451" height="1239" alt="image" 
src="https://github.com/user-attachments/assets/433647ec-a5f8-41d7-8289-7f9f6f15a522";
 />

There is a many-to-many relationship between the `User` entity and the 
`CleanRoom` entity, thus the `UserCleanRoom` entity is introduced as a junction 
table. Within `UserCleanRoom`, the `userStatus` attribute indicates the role of 
that specific `User` within that `CleanRoom` (e.g., `PROVIDER` or `CONSUMER`).

A `User` has their own `Database`, which contains `Table`s, and `Table`s 
contain `Column`s. Since users need to manually select visible `Column`s when 
configuring row and column-level access policies, a separate `Column` entity 
has been established. There is a many-to-many relationship between `Column` and 
`Policy` (a `Column` can be used in multiple `Policy` settings, and a single 
`Policy`'s row/column access configuration can involve multiple `Column`s). 
Therefore, a `PolicyColumn` junction table is created. In `PolicyColumn`, the 
`isAllowed` attribute indicates whether a specific `Column` is permitted to be 
shared and visible within that `Policy`; the `isJoinKey` attribute indicates 
whether this `Column` is designated as a join column within that `Policy`. The 
detailed policy content will be stored as a JSON-formatted `policyRules` 
attribute within the `Policy` instance.

In my design, `Policy`, `View`, and `Notification` are all dependent on the 
existence of a DCR. If a DCR does not exist, then these three entities also do 
not exist. Furthermore, within the same DCR, there can be different `User`s, 
and each `User` can construct different `View`s, `Policy`s, and `Notification`s 
(as the sender of the notification differs, and the DCR joined also differs). 
Therefore, I have connected these three entities to the `UserCleanRoom` entity.

A `View` is constructed based on a `User`'s `Table` and according to the 
`Policy` set by the `User`. `View` has a many-to-one relationship with both 
`Table` and `Policy`.

The `Notification` entity needs to be connected to the `User` entity because it 
requires a `User` to receive the notification; thus, the relationship is 
many-to-one from `Notification` to `User`. The `status` attribute of 
`Notification` represents the state of the notification, such as whether the 
notification has been sent, whether the recipient has agreed to join the DCR, 
and other enumeration states.

## Required Third-Party Libraries/Plugins
- JSqlParser
    
- Python MPC: PSI Protocol
    
- Spring Cloud Config Server or Vault
    
- Firebase Cloud Messaging (FCM)
## Function Points
- **User Registration & Login -- setup & login**
    
    - **Setup:** Users register through the frontend by entering a username, 
login email, company name, and password. The email address must be unique. 
After all user inputs are logically valid and submitted, the backend logic will 
dynamically generate SQL to create the user in Cloudberry and simultaneously 
create a `User` table. At the same time, a dedicated role for this user will be 
generated and bound to user in Cloudberry. The management of passwords for all 
subsequent active roles will be handled by Spring Cloud Config Server or Vault.
        
    - **Login:** The backend retrieves the email and password entered by the 
user in the frontend and performs backend login verification. Upon successful 
login, the backend generates a secure JWT session token for subsequent API 
request authentication and authorization.
        
- **User Database Creation**
    
    - When a frontend user initiates a request to create a database, the 
backend receives the relevant parameters. The `super_admin_role` will 
dynamically generate an independent database (db) within Cloudberry for this 
user. Full control permissions (SELECT, INSERT, UPDATE, DELETE, CREATE TABLE, 
etc.) for this database will be GRANTED to the user's dedicated role. From this 
point, the user can securely manage and operate their own raw data.
        
- **User Uploads CSV/JSON Files**
    
    - Users upload CSV or JSON formatted data files through the frontend 
interface. The files will first be securely uploaded to a temporary storage 
path accessible by the backend server (e.g., a local Linux virtual machine 
address or cloud based storing address). The backend will then dynamically 
generate `CREATE TABLE` SQL and use the current user's dedicated role to create 
the target table under their database.
        
    - Subsequently, by dynamically generating `gpload` YAML content, the 
uploaded files will be loaded into the target table within the target database 
using `gpload`.
        
- **Create DCR**
    
    - When the current user clicks the "create data clean room" button in the 
frontend, the backend will correspondingly create the relevant DCR database, 
its schemas, and roles within Cloudberry. These roles will then be assigned to 
the relevant users and the backend application.
        
    - Furthermore, the backend will create `CleanRoom` and `UserCleanRoom` 
entities, setting the `userStatus` for the latter to `PROVIDER`.
- **Send Invitation Notification** 
        - After a user creates and configures a DCR, they can invite other 
users to join for collaborative data analysis. When the frontend user selects 
and invites another user, the invited user's parameters and the associated 
`UserCleanRoom` attributes details will be transmitted to the backend. Firebase 
Cloud Messaging (FCM) will then send a notification based on these parameters. 
Simultaneously, to ensure that the notification is received even if the user is 
offline, the notification will be saved as a `Notification` entity in the 
database with a "unreceived" (or "pending") status.
    
- **Join DCR** 
        - When the DCR creator adds another user as a DCR participant via the 
frontend, the backend receives the frontend parameters and creates a new 
`UserCleanRoom` object. And the `userState` for the new `UserCleanRoom` entity 
will be set as `CONSUMER`. Before entering the DCR, there will be backend logic 
for access review: if the user is not present in the `User` list of that 
`CleanRoom` entity, entry will be denied.
    
- **Row and Column-Level Access Control** 
        - Users select the tables they wish to share in the frontend, then 
specify the columns allowed for querying and define row-level filter 
conditions. These parameters are transmitted to the backend for dynamic SQL 
generation, which Cloudberry's SQL engine then executes to create a view. 
Simultaneously, a new `View` object representing this view is created and 
stored in its owner's database.
    
- **Join**
    
    - Users configure the join columns in the frontend. The backend, based on 
these frontend parameters, will hash and encrypt the join columns' s value from 
the policy and the IDs of the respective tables.
        
    - Subsequently, by switching to the `dcr_data_staging_{userId}_role`, the 
hashed data will be ingested into the `data-staging-schema` of the DCR database 
using the same `gpload` method as for user-uploaded CSV/JSON files.
        
    - Spring Boot will then transfer this data from 
`dcr_data_staging_{userId}_role` to the MPC layer through message queue.
        
    - After the `cross-reference-table` is generated by MPC module, Spring Boot 
will switch to the `dcr_data_matching_role` to store the 
`cross-reference-table` into the `data-matching-schema`.
        
    - Upon receiving the user's query SQL from the frontend, a third-party SQL 
parser plugin will be used to extract the table names involved in the user's 
join query. Spring Boot will then switch to the `dcr_data_analysis_role` and 
execute dynamically generated join SQL to join the target tables via the 
`cross-reference-table` and views in `data-staging-schema`, storing the result 
in the `data-analysis-schema`.
        
    - Next, Spring Boot will switch to the `dcr_data_analyst_role` and perform 
a `SELECT` query on the target view in `data-analysis-schema` based on the 
dynamically generated SQL.
        
    - Finally, Spring Boot will switch to the `dcr_result_{userId}_role` to 
store the generated result in `data-result-schema`.
        
- **Aggregation** 
        - The backend first retrieves the SQL query submitted by the user and, 
using `JSqlParser`, parses it to extract all aggregation functions used. The 
system will strictly validate whether all aggregation functions used by the 
user are within the allowed aggregation function whitelist defined in the 
`Policy` entity's `policyRules` (JSON) (e.g., `SUM`, `COUNT`, `AVG`, `MIN`, 
`MAX`, `COUNT(DISTINCT)`). Only queries that pass this validation will be 
executed.

### Rollout/Adoption Plan

_No response_

### Are you willing to submit a PR?

- [X] Yes I am willing to submit a PR!

GitHub link: https://github.com/apache/cloudberry/discussions/1270

----
This is an automatically sent email for dev@cloudberry.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: dev-h...@cloudberry.apache.org

Re: [D] [Proposal] Data Clean Room Platform Prototype Based On Cloudberry [cloudberry]

Reply via email to