GitHub user houkawa2 created a discussion: [Proposal] Data Clean Room Platform Prototype Based On Cloudberry
### Proposers houkawa2 ### Proposal Status Under Discussion ### Abstract Build a data clean room(DCR) platform prototype. I would like to create a **new DCR database** where all join and query operations will be performed. Processed view data from participating users, along with the `cross-reference-table` (generated by the MPC layer), will be stored in different schemas within this database. We'll establish various permissioned roles to execute operations, ensuring **minimal data access** for each role. ### Motivation In today's data-driven landscape, enterprises frequently need to collaborate on sensitive datasets to unlock mutual value. As a Massively Parallel Processing (MPP) database, its robust SQL engine and scalability is suitable for supporting a DCR platform. # Data Clean Room(DCR) Overview A privacy-preserving space where multiple enterprises can partially share their data. Through the analysis of this data and the building of models, both parties can achieve more precise ad targeting and other behaviors to gain mutual benefit. # Precedent Analysis: Snowflake & AWS ## Operating Mode DCRs built on cloud platforms like Snowflake and AWS leverage the underlying cloud service providers' distributed storage (e.g., AWS S3) and distributed compute units (e.g., AWS EC2 instances). While Snowflake operates as a data platform _running on_ these cloud infrastructures, AWS Clean Rooms is a dedicated service _built upon_ them. Both approaches facilitate similar user interaction models, predominantly the PROVIDER-CONSUMER model and a third-party orchestration/governance model (where a neutral party coordinates and defines data analysis rules). The process for using a DCR in the **PROVIDER-CONSUMER model** typically involves the following steps: - **DCR Creation and Invitation:** One party initiates the creation of a DCR, then invites other participating parties to join for secure data collaboration and analysis. - **Rule Definition and Data Contribution:** All participating users, acting as data **PROVIDERs** for their own datasets, are responsible for defining stringent privacy-enhancing rules. These rules cover data aggregation constraints, allowed join keys, row-level filters, column-level permissions, and often include parameters for differential privacy. These rules are applied to their **contributed data views** to ensure that raw, sensitive data details are not directly accessible to other participants. - **Query Submission and Enforcement:** After all participants have configured and submitted their rules (which are typically displayed for transparency on the frontend), users (both **PROVIDERs** and **CONSUMERs**) can submit custom SQL queries via the frontend interface. The DCR's backend engine strictly _enforces_ all defined rules during query parsing and execution against privacy-protected, aggregated views of the joined data. - **Results Retrieval:** Upon completion of the query, all authorized parties can retrieve their respective analysis results via the frontend interface. ## Feasibility Analysis I've outlined three potential implementation approaches: ### 1. Centralized DCR Database This approach involves creating a **new DCR database** where all join and query operations will be performed. Processed view data from participating users, along with the `cross-reference-table` (generated by the MPC layer), will be stored in different schemas within this database. We'll establish various permissioned roles to execute operations, ensuring **minimal data access** for each role. ### 2. Cloudberry Federated Query + Secure View + Cross-Reference-Table In this option, each participant creates a **Secure View** using row and column-level permissions and by hashing sensitive data. The consumer then uses a **federated query** to join the `cross-reference-table` (from the MPC layer) with the Secure View and perform queries. However, a significant drawback here is that during the federated query process, **un-noised, temporary results stored in system memory can be accessed by the consumer**, which compromises data privacy. ### 3. Custom Secure Share Platform with Spring Boot (referencing Snowflake's Secure Share) This approach, inspired by **Snowflake's Secure Share** (as detailed in US Patent 11768953B2), uses **Spring Boot as a central coordination service** to implement the logic and management functions. Spring Boot would manage its own metadata store (e.g., a separate database) to register all shared entity information. This includes: - Logical references to shared data objects (pointing to Secure Views in Cloudberry). - Row-level filtering expressions, column whitelists, and anonymization/transformation functions. - Mappings to consumer accounts or roles. When a consumer submits a query, it first goes to the Spring Boot service. Spring Boot parses the original SQL into an Abstract Syntax Tree (AST) and, based on the stored shared metadata, automatically injects privacy and access control logic into the query. This includes **column and row-level filtering conditions** (e.g., `WHERE customer_id IN (SELECT match_uuid FROM cross_reference_table)`), and **encryption function** (e.g., `SELECT HASH(email) ...`). The processed SQL is then sent to a **central DCR Cloudberry database** for execution. This central DCR database queries the pre-processed views and the `cross-reference-table` stored within it. Finally, before returning the results, Spring Boot **uniformly applies differential privacy policies** and ensures all access operations, privacy rule applications, and execution results are meticulously recorded in a **centralized audit log**. However, Cloudberry is a traditional database with **tightly coupled compute and storage**, lacking Snowflake's decoupled architecture and unified global metadata service. Since all of Snowflake's data objects reside within a single metadata service, queries don't need to traverse independent compute instances. In contrast, using Cloudberry as the underlying implementation for Secure Share necessitates **data movement**, leading to data copying and network overhead. Therefore, this third option is not the most suitable solution. **Consequently, I've decided on the first approach as the final implementation framework.** ### Implementation # Building Proposal This proposal is implemented using Cloudberry as the storage database and SQL execution engine. Considering Spring Boot's advantages in concurrency for enterprise-level platforms, I am using Spring Boot as the main framework, with independent Python microservices providing MPC services. Given the enterprise-level data volume, Kafka will be used for asynchronous communication between Spring Boot and Python. ## Framework **Spring Boot Logic Layer** -- Responsible for API entry, authentication and authorization, request validation, notifications, and DCR backend logic implementation. Spring Cloud Config Server or Vault will be used for managing role login passwords. **Python MPC Logic Layer** -- Responsible for implementation of higher-order MPC functionalities in tables join. **Kafka Message Queue** -- Responsible for secure communication between Spring Boot and Python MPC services, and providing better concurrency performance. **Cloudberry Data Layer** -- Responsible for data storage, database role authorization, row and column-level access control, aggregation analysis, join, and other DCR analysis functionalities. ### Cloudberry Data Layer The data layer primarily consists of two parts: the user's own database and the created DCR database. **User's Own Database:** Used for storing uploaded raw data. Initial row and column-level access restrictions, as well as hashing of sensitive join columns and IDs, are also performed within the user's own database. Only encrypted views are transferred from the user's database to the DCR database. **DCR Database:** The DCR database has four layers (schemas): data-staging-schema, data-matching-schema, data-analysis-schema, data-result-schema. -- **data-staging-schema:** Used for storing views input from user databases. In this layer, different users' views are logically isolated, preventing users from seeing each other's data views. -- **data-matching-schema:** Used for storing the unified cross-reference-table for DCR participants, output from the MPC layer. -- **data-analysis-schema:** Used for table joins and backend view generation when users execute SQL queries in the frontend. -- **data-result-schema:** Used for storing users' SQL query results. ### Python MPC Logic Layer This layer leverages the Private Set Intersection (PSI) protocol of MPC to generate the `cross-reference-table`. MPC can operate on the hashed column values of the views passed to it. The PSI protocol matches based on the hash values of the join columns, and for unmatched values, it generates forged (fake) hash values, ultimately generating the `cross-reference-table`. ### Spring Boot Logic Layer This layer acts as a gateway, responsible for API entry, authentication and authorization, request validation, notifications, and the overall DCR backend logic implementation. Spring Cloud Config Server or Vault will be used to manage role login passwords. The specific implementation logic will be detailed in the subsequent **Function Points** section. #### Role Summary In the Spring Boot Logic layer, database file permissions are primarily managed by Cloudberry's Role functionality, with data isolation achieved through logical segregation. Below is a description of the roles required to implement this DCR system and their functionalities. **super_admin_role:** This is the highest management role in the system, used to manage all users' data and the DCR database. It possesses SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, and other permissions on all databases. **User's Personal DB Management Related Roles:** **user_{userId}_data_extractor_role:** This is the individual user's management role. The user has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE, DROP TABLE permissions for the databases and tables they create. No one can view or modify personal data without authorization. **DCR DB Related Roles:** **dcr_admin_role:** This is the DCR database management role, managed by the Spring Boot backend. It has SELECT, INSERT, UPDATE, DELETE, and CREATE TABLE permissions for all schemas within the DCR database. **dcr_data_staging_{userId}_role:** This role is used by individual users to upload views into the DCR database. Each user has their own specific role, and users cannot access each other's data. It possesses INSERT and SELECT permissions for their own data tables within the `data-staging-schema`. **dcr_data_matching_role:** Primarily used by the Spring Boot backend to manage and maintain the `cross-reference-table` within the `data-matching-schema`. It is responsible for receiving the unified `cross-reference-table` data output from the MPC layer and ensuring its correct storage and availability for subsequent analysis. It has SELECT, INSERT, UPDATE, and DELETE permissions for this schema, but cannot CREATE TABLE. Tables should be created by `dcr_admin_role` at first. **dcr_data_analysis_role:** Primarily used by the Spring Boot backend to manage and maintain data within the `data-analysis-schema`. When joining tables, it needs to fetch data files from the `data-staging-schema` and `data-matching-schema` and generate relevant joined views. It also needs to use this role to place SQL query results into the `data-result-schema`. Therefore, this role holds: - SELECT permissions on `data-staging-schema` and `data-matching-schema`. - CREATE VIEW, ALTER VIEW, DROP VIEW, SELECT, INSERT, UPDATE, and DELETE permissions on `data-analysis-schema`. - CREATE VIEW, ALTER VIEW, DROP VIEW, INSERT, UPDATE, and DELETE permissions on `data-result-schema`. **dcr_data_analyst_role:** This is a role specifically used for executing user SQL queries, used by the Spring Boot backend. It has SELECT permissions for data within the `data-analysis-schema`. **dcr_data_result_{userId}_role:** Users utilize this role to access their respective results. It has SELECT permission for its own results within the `data-result-schema`. #### Spring Boot Internal Database The Spring Boot internal database is used to store associated data required for frontend display. <img width="1451" height="1239" alt="image" src="https://github.com/user-attachments/assets/433647ec-a5f8-41d7-8289-7f9f6f15a522" /> There is a many-to-many relationship between the `User` entity and the `CleanRoom` entity, thus the `UserCleanRoom` entity is introduced as a junction table. Within `UserCleanRoom`, the `userStatus` attribute indicates the role of that specific `User` within that `CleanRoom` (e.g., `PROVIDER` or `CONSUMER`). A `User` has their own `Database`, which contains `Table`s, and `Table`s contain `Column`s. Since users need to manually select visible `Column`s when configuring row and column-level access policies, a separate `Column` entity has been established. There is a many-to-many relationship between `Column` and `Policy` (a `Column` can be used in multiple `Policy` settings, and a single `Policy`'s row/column access configuration can involve multiple `Column`s). Therefore, a `PolicyColumn` junction table is created. In `PolicyColumn`, the `isAllowed` attribute indicates whether a specific `Column` is permitted to be shared and visible within that `Policy`; the `isJoinKey` attribute indicates whether this `Column` is designated as a join column within that `Policy`. The detailed policy content will be stored as a JSON-formatted `policyRules` attribute within the `Policy` instance. In my design, `Policy`, `View`, and `Notification` are all dependent on the existence of a DCR. If a DCR does not exist, then these three entities also do not exist. Furthermore, within the same DCR, there can be different `User`s, and each `User` can construct different `View`s, `Policy`s, and `Notification`s (as the sender of the notification differs, and the DCR joined also differs). Therefore, I have connected these three entities to the `UserCleanRoom` entity. A `View` is constructed based on a `User`'s `Table` and according to the `Policy` set by the `User`. `View` has a many-to-one relationship with both `Table` and `Policy`. The `Notification` entity needs to be connected to the `User` entity because it requires a `User` to receive the notification; thus, the relationship is many-to-one from `Notification` to `User`. The `status` attribute of `Notification` represents the state of the notification, such as whether the notification has been sent, whether the recipient has agreed to join the DCR, and other enumeration states. ## DCR Backend Logic Description ### Join During the DCR creation process, users must specify the columns for joining. I have chosen to implement table-to-table joins for users by constructing a `cross-reference-table`. 1. The user's view, which has been processed for row and column-level restrictions and had sensitive columns hashed, is ingested into the `data-staging-schema` of the DCR database. Spring Boot then transmits the processed views in `data-staging-schema` to the Python MPC layer for matching computations. 2. The MPC layer operates on the hashed join columns, attempting to match them with the join columns of other users participating in the DCR. If a participating party's data does not find a corresponding match during this process, the MPC module intelligently generates a forged (fake) hash value for it. This fake value participates in subsequent computations, preventing other parties from inferring which records in a dataset did not match successfully based on the quantity or pattern of matching results, thereby effectively avoiding information leakage and data inference. 3. Upon completion of the matching by the MPC module, the participating parties will receive a unified `cross-reference-table`. This table contains an anonymous and unique `match_uuid` (serving as the identifier for each row), as well as the respective hashed Join Column values for all participating parties. For example, in a two-party match, it might include `match_uuid`, `party_A_hashed_join_key`, and `party_B_hashed_join_key`. This design ensures anonymous association of data within the platform and does not contain any identifiable original user IDs. 4. After the join is completed, the `cross-reference-table` is ingested into the `data-matching-schema` of the DCR database, facilitating subsequent DCR operations. P.S. When column names differ, standardized column name mappings will be provided to normalize the join. ### Aggregation & Row and Column-Level Access Control These functionalities are driven by frontend user input, with Spring Boot dynamically generating the corresponding SQL statements, which are then executed by Cloudberry's SQL engine. ### User SQL Query - The DCR will provide users with pre-defined queryable tables, columns, and aggregation rules. Users can then submit custom SQL query requests through the DCR frontend based on these established rules. - `JSqlParser` is used to parse the user's query SQL, verifying whether the aggregation functions used in the SQL (such as SUM, COUNT, AVG) are within the list permitted by the DCR's policies. - Once the query validation is successful, the Spring Boot backend will use the permissions of the **`dcr_analyst_role`** to execute the query within the DCR database. User queries are performed against existing analytical views within the `data-analysis-schema`. These views have inherently encapsulated the join logic with the `cross-reference-table` and view data from various parties (originating from the `data-staging-schema`) . The database then returns the aggregated results to the Spring Boot backend. After the query is completed, the Spring Boot backend will apply differential privacy (adding noise) to these aggregated results to further protect against the inference of data details. - The Spring Boot backend, by switching to the permissions of the **`dcr_data_analysis_role`**, stores these results into the user's dedicated result table (or view) within the `data-result-schema`, making them available for subsequent on-demand access. ### Case A large retail chain (Party A: e.g., "Retail Giant") wants to understand how many of their existing customers have also seen an advertising company's (Party B: e.g., "Digital Marketing") latest ad campaign, and what the average purchase value of these common customers is at the Retail Giant's stores. 1. **Participating Parties and Their Data** **Party A: Retail Giant** Raw Data: `customer_id` (internal ID), `email_address` (customer email), `purchase_value` (purchase amount), `purchase_date` (purchase date). Data provided to DCR: Hashed `email_address`, `purchase_value`, and `purchase_date`. **Party B: Digital Marketing** Raw Data: `user_id` (internal ID), `email_address` (user email), `ad_campaign_id` (ad campaign ID), `ad_impression_timestamp` (ad impression timestamp). Data to be provided to DCR: Hashed `email_address`, `ad_campaign_id`, and `ad_impression_timestamp`. 2. **Generating the Cross-Reference Table** Both parties perform a join via `hashed_email`. If a hashed email from Party A is not found in Party B's data, or vice-versa, the MPC module will generate a forged (fake) hash value. This ensures that neither party can infer the number of unmatched customers on the other side by analyzing the size or absence of matching results, thereby preventing information leakage. Ultimately, the MPC will output the join results, generating unified `cross-reference-table`. An example `cross-reference-table` is shown below: ``` | `match_uuid` | `retail_giant_hashed_email` | `digital_marketing_hashed_email` | | `uuid_001` | `hash_A_xyz` | `hash_B_xyz` | | `uuid_002` | `hash_A_abc` | `hash_B_abc` | | `uuid_003` | `hash_A_def` | `fake_hash_B_1` | --this is the fake hash | `uuid_004` | `fake_hash_A_1` | `hash_B_uvw` | ``` 3. **User Inputs SQL Query in the Frontend Interface** Both users' selected tables and queryable columns will be displayed in the frontend. Users can perform join queries or single-table queries according to their needs. <img width="744" height="288" alt="image" src="https://github.com/user-attachments/assets/8ca6d2f5-a559-4449-82b2-d33e82b32d27" /> Backend Logic: (1) Upon receiving and validating the user's SQL query via an SQL Parser, the backend can obtain the table names referenced in the query. (2) The backend, after retrieving the relevant views based on the table names, will join them with other tables using the `cross-reference-table`. It then dynamically generates the user's requested SQL query using a defined SQL template and executes it. The example backend sql: ``` SELECT dma.ad_campaign_id, COUNT(DISTINCT crt.match_uuid) AS unique_matched_customers, COUNT(dma.ad_impression_timestamp) AS total_matched_impressions FROM dcr_matches.cross_reference_table crt JOIN dcr_parties.retail_giant_purchases rp ON crt.retail_giant_hashed_email = rp.hashed_email JOIN dcr_parties.digital_marketing_ads dma ON crt.digital_marketing_hashed_email = dma.hashed_email WHERE rp.purchase_date >= CURRENT_DATE - INTERVAL '{{ NUM_DAYS_RECENT_PURCHASES }} days' AND dma.ad_impression_timestamp IS NOT NULL GROUP BY dma.ad_campaign_id ORDER BY unique_matched_customers DESC; ``` 4. **Spring Boot Switches Roles, Returns Results to Frontend, and Stores Them in** `data-result-schema`. ## Required Third-Party Libraries/Plugins - JSqlParser - Python MPC: PSI Protocol - Spring Cloud Config Server or Vault - Firebase Cloud Messaging (FCM) ## Function Points - **User Registration & Login -- setup & login** - **Setup:** Users register through the frontend by entering a username, login email, company name, and password. The email address must be unique. After all user inputs are logically valid and submitted, the backend logic will dynamically generate SQL to create the user in Cloudberry and simultaneously create a `User` table. At the same time, a dedicated role for this user will be generated and bound to user in Cloudberry. The management of passwords for all subsequent active roles will be handled by Spring Cloud Config Server or Vault. - **Login:** The backend retrieves the email and password entered by the user in the frontend and performs backend login verification. Upon successful login, the backend generates a secure JWT session token for subsequent API request authentication and authorization. - **User Database Creation** - When a frontend user initiates a request to create a database, the backend receives the relevant parameters. The `super_admin_role` will dynamically generate an independent database (db) within Cloudberry for this user. Full control permissions (SELECT, INSERT, UPDATE, DELETE, CREATE TABLE, etc.) for this database will be GRANTED to the user's dedicated role. From this point, the user can securely manage and operate their own raw data. - **User Uploads CSV/JSON Files** - Users upload CSV or JSON formatted data files through the frontend interface. The files will first be securely uploaded to a temporary storage path accessible by the backend server (e.g., a local Linux virtual machine address or cloud based storing address). The backend will then dynamically generate `CREATE TABLE` SQL and use the current user's dedicated role to create the target table under their database. - Subsequently, by dynamically generating `gpload` YAML content, the uploaded files will be loaded into the target table within the target database using `gpload`. - **Create DCR** - When the current user clicks the "create data clean room" button in the frontend, the backend will correspondingly create the relevant DCR database, its schemas, and roles within Cloudberry. These roles will then be assigned to the relevant users and the backend application. - Furthermore, the backend will create `CleanRoom` and `UserCleanRoom` entities, setting the `userStatus` for the latter to `PROVIDER`. - **Send Invitation Notification** - After a user creates and configures a DCR, they can invite other users to join for collaborative data analysis. When the frontend user selects and invites another user, the invited user's parameters and the associated `UserCleanRoom` attributes details will be transmitted to the backend. Firebase Cloud Messaging (FCM) will then send a notification based on these parameters. Simultaneously, to ensure that the notification is received even if the user is offline, the notification will be saved as a `Notification` entity in the database with a "unreceived" (or "pending") status. - **Join DCR** - When the DCR creator adds another user as a DCR participant via the frontend, the backend receives the frontend parameters and creates a new `UserCleanRoom` object. And the `userState` for the new `UserCleanRoom` entity will be set as `CONSUMER`. Before entering the DCR, there will be backend logic for access review: if the user is not present in the `User` list of that `CleanRoom` entity, entry will be denied. - **Row and Column-Level Access Control** - Users select the tables they wish to share in the frontend, then specify the columns allowed for querying and define row-level filter conditions. These parameters are transmitted to the backend for dynamic SQL generation, which Cloudberry's SQL engine then executes to create a view. Simultaneously, a new `View` object representing this view is created and stored in its owner's database. - **Join** - Users configure the join columns in the frontend. The backend, based on these frontend parameters, will hash and encrypt the join columns' s value from the policy and the IDs of the respective tables. - Subsequently, by switching to the `dcr_data_staging_{userId}_role`, the hashed data will be ingested into the `data-staging-schema` of the DCR database using the same `gpload` method as for user-uploaded CSV/JSON files. - Spring Boot will then transfer this data from `dcr_data_staging_{userId}_role` to the MPC layer through message queue. - After the `cross-reference-table` is generated by MPC module, Spring Boot will switch to the `dcr_data_matching_role` to store the `cross-reference-table` into the `data-matching-schema`. - Upon receiving the user's query SQL from the frontend, a third-party SQL parser plugin will be used to extract the table names involved in the user's join query. Spring Boot will then switch to the `dcr_data_analysis_role` and execute dynamically generated join SQL to join the target tables via the `cross-reference-table` and views in `data-staging-schema`, storing the result in the `data-analysis-schema`. - Next, Spring Boot will switch to the `dcr_data_analyst_role` and perform a `SELECT` query on the target view in `data-analysis-schema` based on the dynamically generated SQL. - Finally, Spring Boot will switch to the `dcr_result_{userId}_role` to store the generated result in `data-result-schema`. - **Aggregation** - The backend first retrieves the SQL query submitted by the user and, using `JSqlParser`, parses it to extract all aggregation functions used. The system will strictly validate whether all aggregation functions used by the user are within the allowed aggregation function whitelist defined in the `Policy` entity's `policyRules` (JSON) (e.g., `SUM`, `COUNT`, `AVG`, `MIN`, `MAX`, `COUNT(DISTINCT)`). Only queries that pass this validation will be executed. ### Rollout/Adoption Plan _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR! GitHub link: https://github.com/apache/cloudberry/discussions/1270 ---- This is an automatically sent email for dev@cloudberry.apache.org. To unsubscribe, please send an email to: dev-unsubscr...@cloudberry.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cloudberry.apache.org For additional commands, e-mail: dev-h...@cloudberry.apache.org