Thespica commented on issue #738:
URL:
https://github.com/apache/incubator-graphar/issues/738#issuecomment-3244749174
> Can we anyhow separate the java-kernel from IO? For me, if we include
anything for parquet (or Hadoop) to the core, users will face a dependencies
hell. Shading does not sound as a good option for me, because parquet-java is
about 3 Mbs and if add more, we will end with a JAR of 100 Mbs.
>
> Like:
>
> * `org.apache.graphar.java`
> * `org.apache.graphar.java-io`
>
> Or something like this. For example, I want to use the future Java lib for
creating GraphFrames integration, but I need only the logic of schemas and
paths resolving and actual reading will be done by Spark.
@SemyonSinchenko Thanks a lot for your valuable input. I completely agree
that we need to separate the core logic from the I/O implementation to avoid
the "dependency hell".
Building on that idea, my thought is to structure the overall architecture
into four abstraction layers, from the highest to the lowest:
* **Layer 1: Schema Level (Metadata Layer)**: This layer only deals with
the `.yml` configuration files and is completely independent of the actual data
files. This corresponds to the need for a `java-info` module you mentioned.
* **Layer 2: Graph Level (High-Level Graph API)**: Also known as the "high
level" in the C++ library, it allows users to read and write from a
graph-centric perspective.
* **Layer 3: Table Level (Logical Table API)**: Corresponding to the "mid
level" in the C++ library, it allows users to read and write from the
perspective of a logical table (e.g., a table composed of all `name` properties
for a vertex).
* **Layer 4: Chunk Level (Physical Chunk API)**: This is the lowest level
of abstraction, where users operate from the perspective of a single physical
file/chunk.
The key point here is that **only the third layer (Table Level) is directly
related to specific data file formats** like Parquet.
To implement this layered concept and thoroughly address the dependency
concerns you raised, we can design the following modular structure. This
structure completely separates the API definitions from their concrete
implementations:
```
- Layer 1: Schema Level API (Metadata Layer, Core API module, with no
heavy I/O dependencies)
- Responsibility: Parse YAML files, provide schema access, and
generate file paths.
- Example:
// Input: "person.vertex.yml"
// Output: Get properties of "person" -> {id: int64, name:
string, age: int32}
// Output: Get path for the first chunk of the "name" property
for "person" -> "/path/to/vertex/person/name/chunk0"
- Layer 2: Graph Level API (High-Level Graph API)
- Responsibility: Provide a graph-oriented read/write API,
translating graph operations into underlying table operations.
- Example:
// Input: getVertices("person")
// Output: An iterator of vertex objects -> [Vertex{id:0,
name:"Alice", age:30}, Vertex{id:1, name:"Bob", age:25}, ...]
- Layer 3: Table Level API (Logical Table API)
- Responsibility: Define abstract interfaces for reading/writing
logical tables (e.g., a property column), with pluggable I/O implementations.
- Example:
// Input: readTable("person", "name")
// Output: An Arrow Table or a similar structure containing all
data for that property
// | id | name |
// |----|-------|
// | 0 | Alice |
// | 1 | Bob |
- Pluggable Implementations (Users include them as needed):
- `graphar-java-parquet`: Provides the concrete implementation
for the Parquet format.
- `graphar-java-orc`: Provides the concrete implementation
for the ORC format.
- ... (More can be added in the future)
- Layer 4: Chunk Level API (Physical Chunk API)
- Responsibility: Responsible for reading and writing a single
physical chunk file.
- Example:
// Input: readFile("/path/to/vertex/person/name/chunk0")
// Output: The raw byte stream of the file content, or a
Parquet/ORC Reader object.
```
**The advantages of this architecture are very clear:**
1. **For your Spark integration scenario**: You would **only need to depend
on `graphar-java-info`**. This gives you all the necessary metadata handling,
path generation, and high-level API definitions while completely avoiding any
unnecessary Parquet or Hadoop dependencies.
2. **For general users**: If they want to read/write Parquet files
directly, they would simply add an additional dependency on the
`graphar-java-parquet` module. This module would then inject the concrete
implementation for the Layer 3 (Table Level) API.
3. **For us as developers**: This design makes the codebase's
responsibilities clear and highly extensible. Supporting new formats in the
future would only require adding a new module, with zero intrusion into the
core code.
I believe this approach can thoroughly resolve the dependency concerns you
raised and provide a clear, flexible, and extensible future for the project. I
look forward to your thoughts.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]