Re: Dataset doesn't have partitioner after a repartition on one of the columns

Igor Berman Wed, 28 Sep 2016 14:30:45 -0700

Michael, can you explain please why bucketBy is supported when using
writeAsTable() to parquet by not with parquet()
Is it only difference between table api and dataframe/dataset api? or there
are some other?


org.apache.spark.sql.AnalysisException: 'save' does not support bucketing
right now;
at
org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:310)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:203)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:478)


thanks in advance


On 28 September 2016 at 21:26, Michael Armbrust <mich...@databricks.com>
wrote:

> Hi Darin,
>
> In SQL we have finer grained information about partitioning, so we don't
> use the RDD Partitioner.  Here's a notebook
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3633335638369146/2840265927289860/latest.html>that
> walks through what we do expose and how it is used by the query planner.
>
> Michael
>
> On Tue, Sep 20, 2016 at 11:22 AM, McBeath, Darin W (ELS-STL) <
> d.mcbe...@elsevier.com> wrote:
>
>> I’m using Spark 2.0.
>>
>> I’ve created a dataset from a parquet file and repartition on one of the
>> columns (docId) and persist the repartitioned dataset.
>>
>> val om = ds.repartition($"docId”).persist(StorageLevel.MEMORY_AND_DISK)
>>
>> When I try to confirm the partitioner, with
>>
>> om.rdd.partitioner
>>
>> I get
>>
>> Option[org.apache.spark.Partitioner] = None
>>
>> I would have thought it would be HashPartitioner.
>>
>> Does anyone know why this would be None and not HashPartitioner?
>>
>> Thanks.
>>
>> Darin.
>>
>>
>>
>

Re: Dataset doesn't have partitioner after a repartition on one of the columns

Reply via email to