Thespica opened a new issue, #680: URL: https://github.com/apache/incubator-graphar/issues/680
### Describe the enhancement requested # Description This project aims to enhance Apache GraphAr (incubating)'s data loading capabilities within the Spark environment, enabling it to directly export data in GraphAr format from the LDBC SNB Datagen Spark interface. Apache GraphAr is dedicated to providing a standard file format framework for graph data storage, with underlying support for various formats such as CSV, Apache Parquet, and Apache ORC, facilitating efficient circulation and interoperability of graph data across different systems. GraphAr provides corresponding Spark APIs to support reading and writing data in its format. LDBC SNB Datagen is a Spark-based tool specifically designed to generate large-scale, directed, labeled graph datasets with realistic social network characteristics required for the LDBC Social Network Benchmark tests. While the standard output of this tool is typically a collection of structured CSV files describing vertices, edges, and their properties, the goal of this project is to directly utilize the data structures generated internally within Spark (such as RDDs or DataFrames) by Datagen, rather than reading these already generated output files. - Code repository: https://github.com/ldbc/ldbc_snb_datagen_spark - Documentation: https://github.com/ldbc/ldbc_snb_docs#how-to-cite-ldbc-benchmarks Currently, although data generated by LDBC Datagen is often used for testing graph systems, importing the standard output files of this data into graph databases or graph computing systems for processing in a Spark environment typically requires users to write additional custom code or conversion scripts to read the CSV files. GraphAr provides convenience for data transfer in this context, and this project further optimizes this by aiming to avoid the intermediate step of reading and writing CSV files. The core objective of this project is to develop one or a set of functional modules within the Apache GraphAr project, utilizing its existing or extended Spark API, that can directly receive and process the data structures generated in Spark's memory by LDBC SNB Datagen and seamlessly convert them into GraphAr formatted in-memory structures or file storage. This means creating a direct pipeline from the internal data of LDBC Datagen to the GraphAr format. Through this work, users will be able to directly utilize the Spark writing/conversion interfaces provided by GraphAr to process data generated by LDBC Datagen within the Spark environment, without needing to first write the data to standard output files and then read them back. This will allow for subsequent graph analysis and processing in the Spark environment using the GraphAr format. This will greatly simplify the process of connecting LDBC datasets with the GraphAr ecosystem, lower the barrier to entry for users, and further promote the application of GraphAr as a standard format for graph data loading and processing. # Output Requirements 1. Core Loading Functionality Implementation: Within the GraphAr project, implement the functional code based on the GraphAr Spark API to directly export GraphAr formatted data from LDBC SNB Datagen. 2. Test Cases: Write sufficient unit tests and integration tests to verify the correctness and performance of the loading functionality (covering at least major entity types and relationship types). 3. User Documentation: Provide clear and detailed documentation in Chinese or English explaining how to use the new Spark loading interface. 4. Example Code: Provide at least one runnable example demonstrating how to load data generated by LDBC SNB Datagen. # Technical Requirements 1. Familiarity with Java/Scala programming languages. 2. Familiarity with Apache Spark or related big data frameworks. 3. Understanding of graph database related background/concepts. # More Details Apply at OSPP: https://summer-ospp.ac.cn/org/prodetail/25e7a0292 Expected Completion Hours: 180 Hours ### Component(s) Spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@graphar.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@graphar.apache.org For additional commands, e-mail: commits-h...@graphar.apache.org