[ 
https://issues.apache.org/jira/browse/KUDU-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke reassigned KUDU-2490:
---------------------------------

    Assignee: Grant Henke

> implement Kudu DataSourceV2 and related classes
> -----------------------------------------------
>
>                 Key: KUDU-2490
>                 URL: https://issues.apache.org/jira/browse/KUDU-2490
>             Project: Kudu
>          Issue Type: Improvement
>          Components: spark
>            Reporter: Andrew Wong
>            Assignee: Grant Henke
>            Priority: Major
>              Labels: roadmap-candidate
>
> The current Kudu-Spark bindings implement a DefaultSource that extends a 
> RelationProvider, which provides BaseRelations to Spark, which, as I 
> understand it, are physical units of query execution and represent sets of 
> rows. The Kudu BaseRelation (the KuduRelation) implements a couple of traits 
> to fit into Spark: PrunedFilteredScan, which allows predicates to be pushed 
> into Kudu, and InsertableRelation, which allows writes to be pushed into 
> Kudu. An issue with these bindings is that, while they provide interfaces to 
> insert/get data, they do not provide interfaces to push details to Spark that 
> might be useful to optimizing a Kudu query.
> Among other things, this is inconvenient for all datasources that might want 
> to take such optimizations into their own hands, and the Spark community 
> appears to be making efforts in revamping their DataSource APIs in the form 
> of DataSourceV2, and as it pertains to read support, the v2 DataSourceReader. 
> This new world order provides a clear path towards implementing various 
> optimizations that are currently unavailable with the current Spark bindings, 
> without pushing changes to Spark itself.
> Of note, the v2 DataSourceReader can be extended with 
> SupportsReportStatistics, which could allow Kudu to expose statistics to Kudu 
> without having to rely on HMS (although pushing stats to HMS isn't an 
> unreasonable approach either). More traits and details about the API can be 
> found 
> [here|https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/sources/v2/reader/DataSourceReader.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to