Weichen Xu created SPARK-46361: ---------------------------------- Summary: Add spark dataset chunk read API (python only) Key: SPARK-46361 URL: https://issues.apache.org/jira/browse/SPARK-46361 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 4.0.0 Reporter: Weichen Xu
*Proposed API:* {code:java} def persist_dataframe_as_chunks(dataframe: DataFrame) -> list[str]: """ Persist the spark dataframe as chunks, each chunk is an arrow batch. Return the list of chunk ids. This function is only available when it is called from spark driver process. """ def read_chunk(chunk_id): """ Read chunk by id, return arrow batch data of this chunk. You can call this function from spark driver, spark python UDF python, descendant process of spark driver, or descendant process of spark python UDF worker. """ def unpersist_chunks(chunk_ids: list[str]) -> None: """ Remove chunks by chunk ids. This function is only available when it is called from spark driver process. """{code} *Motivation:* In Ray on spark, we want to support loading Ray data from arbitrary spark Dataframe with in-memory conversion, for Ray on spark, Ray datasource read-task runs as child process of Ray worker node, and in Ray on spark, we launch Ray worker node as child process of pyspark UDF worker. So that the above proposed API allows descendent python process of pyspark UDF worker to read a chunk data of given spark dataframe. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org