Hi Spark users,
I'm currently investigating spark's bucketing and partitioning
capabilities and I have some questions:
Let /T/ be a table that is bucketed and sorted by /T.id/ and partitioned
by /T.date/. Before persisting, /T/ has been repartitioned by /T.id/ to
get only one file per bucket.
I want to group by /T.id/ over a subset of /T.date/'s values.
It seems to me that the best execution plan in this scenario would be
the following:
- Schedule one stage (no exchange) with as many tasks as we have
bucket-ids, so that there is a mapping from each task to a bucket-id
- Each tasks opens all bucket-files belonging to "it's" bucket-id
simultaneously, which is one per affected partition /T.date/
- Since the data inside the buckets are sorted, we can perform the
second phase of "two-phase-multiway-merge-sort" to get our groups, which
can be "pipelined" into the next operator
From what I understand after scanning through the code, however, it
appears to me that each bucket-file is read completely before the
record-iterator is advanced to the next bucket file (see FileScanRDD ,
same applies to Hive). So a groupBy would require to sort the partitions
of the resulting RDD before the groups can be emitted, which results in
a blocking operation.
Could anyone confirm that I'm assessing the situation correctly here, or
correct me if not?
Followup questions:
1. Is there a way to get the "sql" groups into the RDD API, like the RDD
groupBy would return them? I fail to formulate a job like this, because
a query with groupBy, that misses an aggregation function, is invalid.
2. I haven't simply testet this, because I fail to load a table with the
specified properties like above:
After writing a table like this:
.write().partitionBy("date").bucketBy(4,"id").sortBy("id").format("json").saveAsTable("table");
I fail to read it again, with the partitioning and bucketing being
recognized.
Is a functioning Hive-Metastore required for this to work, or is there a
workaround?
I hope someone can spare the time to help me out here.
All the best,
Fridtjof