Thespica commented on issue #738:
URL: 
https://github.com/apache/incubator-graphar/issues/738#issuecomment-3244749174

   > Can we anyhow separate the java-kernel from IO? For me, if we include 
anything for parquet (or Hadoop) to the core, users will face a dependencies 
hell. Shading does not sound as a good option for me, because parquet-java is 
about 3 Mbs and if add more, we will end with a JAR of 100 Mbs.
   > 
   > Like:
   > 
   > * `org.apache.graphar.java`
   > * `org.apache.graphar.java-io`
   > 
   > Or something like this. For example, I want to use the future Java lib for 
creating GraphFrames integration, but I need only the logic of schemas and 
paths resolving and actual reading will be done by Spark.
   
   
   @SemyonSinchenko Thanks a lot for your valuable input. I completely agree 
that we need to separate the core logic from the I/O implementation to avoid 
the "dependency hell".
   
   Building on that idea, my thought is to structure the overall architecture 
into four abstraction layers, from the highest to the lowest:
   
   *   **Layer 1: Schema Level (Metadata Layer)**: This layer only deals with 
the `.yml` configuration files and is completely independent of the actual data 
files. This corresponds to the need for a `java-info` module you mentioned.
   *   **Layer 2: Graph Level (High-Level Graph API)**: Also known as the "high 
level" in the C++ library, it allows users to read and write from a 
graph-centric perspective.
   *   **Layer 3: Table Level (Logical Table API)**: Corresponding to the "mid 
level" in the C++ library, it allows users to read and write from the 
perspective of a logical table (e.g., a table composed of all `name` properties 
for a vertex).
   *   **Layer 4: Chunk Level (Physical Chunk API)**: This is the lowest level 
of abstraction, where users operate from the perspective of a single physical 
file/chunk.
   
   The key point here is that **only the third layer (Table Level) is directly 
related to specific data file formats** like Parquet.
   
   To implement this layered concept and thoroughly address the dependency 
concerns you raised, we can design the following modular structure. This 
structure completely separates the API definitions from their concrete 
implementations:
   
   ```
       - Layer 1: Schema Level API (Metadata Layer, Core API module, with no 
heavy I/O dependencies)
           - Responsibility: Parse YAML files, provide schema access, and 
generate file paths.
           - Example:
               // Input: "person.vertex.yml"
               // Output: Get properties of "person" -> {id: int64, name: 
string, age: int32}
               // Output: Get path for the first chunk of the "name" property 
for "person" -> "/path/to/vertex/person/name/chunk0"
   
       - Layer 2: Graph Level API (High-Level Graph API)
           - Responsibility: Provide a graph-oriented read/write API, 
translating graph operations into underlying table operations.
           - Example:
               // Input: getVertices("person")
               // Output: An iterator of vertex objects -> [Vertex{id:0, 
name:"Alice", age:30}, Vertex{id:1, name:"Bob", age:25}, ...]
   
       - Layer 3: Table Level API (Logical Table API)
           - Responsibility: Define abstract interfaces for reading/writing 
logical tables (e.g., a property column), with pluggable I/O implementations.
           - Example:
               // Input: readTable("person", "name")
               // Output: An Arrow Table or a similar structure containing all 
data for that property
               // | id | name  |
               // |----|-------|
               // | 0  | Alice |
               // | 1  | Bob   |
   
           - Pluggable Implementations (Users include them as needed):
               - `graphar-java-parquet`: Provides the concrete implementation 
for the Parquet format.
               - `graphar-java-orc`:    Provides the concrete implementation 
for the ORC format.
               - ... (More can be added in the future)
   
       - Layer 4: Chunk Level API (Physical Chunk API)
           - Responsibility: Responsible for reading and writing a single 
physical chunk file.
           - Example:
               // Input: readFile("/path/to/vertex/person/name/chunk0")
               // Output: The raw byte stream of the file content, or a 
Parquet/ORC Reader object.
   ```
   
   **The advantages of this architecture are very clear:**
   
   1.  **For your Spark integration scenario**: You would **only need to depend 
on `graphar-java-info`**. This gives you all the necessary metadata handling, 
path generation, and high-level API definitions while completely avoiding any 
unnecessary Parquet or Hadoop dependencies.
   
   2.  **For general users**: If they want to read/write Parquet files 
directly, they would simply add an additional dependency on the 
`graphar-java-parquet` module. This module would then inject the concrete 
implementation for the Layer 3 (Table Level) API.
   
   3.  **For us as developers**: This design makes the codebase's 
responsibilities clear and highly extensible. Supporting new formats in the 
future would only require adding a new module, with zero intrusion into the 
core code.
   
   I believe this approach can thoroughly resolve the dependency concerns you 
raised and provide a clear, flexible, and extensible future for the project. I 
look forward to your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to