Thespica opened a new issue, #680:
URL: https://github.com/apache/incubator-graphar/issues/680

   ### Describe the enhancement requested
   
   # Description
   
   This project aims to enhance Apache GraphAr (incubating)'s data loading 
capabilities within the Spark environment, enabling it to directly export data 
in GraphAr format from the LDBC SNB Datagen Spark interface.
   Apache GraphAr is dedicated to providing a standard file format framework 
for graph data storage, with underlying support for various formats such as 
CSV, Apache Parquet, and Apache ORC, facilitating efficient circulation and 
interoperability of graph data across different systems. GraphAr provides 
corresponding Spark APIs to support reading and writing data in its format.
   LDBC SNB Datagen is a Spark-based tool specifically designed to generate 
large-scale, directed, labeled graph datasets with realistic social network 
characteristics required for the LDBC Social Network Benchmark tests. While the 
standard output of this tool is typically a collection of structured CSV files 
describing vertices, edges, and their properties, the goal of this project is 
to directly utilize the data structures generated internally within Spark (such 
as RDDs or DataFrames) by Datagen, rather than reading these already generated 
output files.
   
   - Code repository: https://github.com/ldbc/ldbc_snb_datagen_spark 
   - Documentation: 
https://github.com/ldbc/ldbc_snb_docs#how-to-cite-ldbc-benchmarks
   
   Currently, although data generated by LDBC Datagen is often used for testing 
graph systems, importing the standard output files of this data into graph 
databases or graph computing systems for processing in a Spark environment 
typically requires users to write additional custom code or conversion scripts 
to read the CSV files. GraphAr provides convenience for data transfer in this 
context, and this project further optimizes this by aiming to avoid the 
intermediate step of reading and writing CSV files.
   The core objective of this project is to develop one or a set of functional 
modules within the Apache GraphAr project, utilizing its existing or extended 
Spark API, that can directly receive and process the data structures generated 
in Spark's memory by LDBC SNB Datagen and seamlessly convert them into GraphAr 
formatted in-memory structures or file storage. This means creating a direct 
pipeline from the internal data of LDBC Datagen to the GraphAr format.
   
   Through this work, users will be able to directly utilize the Spark 
writing/conversion interfaces provided by GraphAr to process data generated by 
LDBC Datagen within the Spark environment, without needing to first write the 
data to standard output files and then read them back. This will allow for 
subsequent graph analysis and processing in the Spark environment using the 
GraphAr format. This will greatly simplify the process of connecting LDBC 
datasets with the GraphAr ecosystem, lower the barrier to entry for users, and 
further promote the application of GraphAr as a standard format for graph data 
loading and processing.
   
   # Output Requirements
   
   1. Core Loading Functionality Implementation: Within the GraphAr project, 
implement the functional code based on the GraphAr Spark API to directly export 
GraphAr formatted data from LDBC SNB Datagen.
   2. Test Cases: Write sufficient unit tests and integration tests to verify 
the correctness and performance of the loading functionality (covering at least 
major entity types and relationship types).
   3. User Documentation: Provide clear and detailed documentation in Chinese 
or English explaining how to use the new Spark loading interface.
   4. Example Code: Provide at least one runnable example demonstrating how to 
load data generated by LDBC SNB Datagen.
   
   # Technical Requirements
   
   1. Familiarity with Java/Scala programming languages.
   2. Familiarity with Apache Spark or related big data frameworks.
   3. Understanding of graph database related background/concepts.
   
   # More Details
   Apply at OSPP: https://summer-ospp.ac.cn/org/prodetail/25e7a0292
   Expected Completion Hours: 180 Hours
   
   ### Component(s)
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@graphar.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@graphar.apache.org
For additional commands, e-mail: commits-h...@graphar.apache.org

Reply via email to