Michael, can you explain please why bucketBy is supported when using writeAsTable() to parquet by not with parquet() Is it only difference between table api and dataframe/dataset api? or there are some other?
org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:310) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:203) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:478) thanks in advance On 28 September 2016 at 21:26, Michael Armbrust <mich...@databricks.com> wrote: > Hi Darin, > > In SQL we have finer grained information about partitioning, so we don't > use the RDD Partitioner. Here's a notebook > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3633335638369146/2840265927289860/latest.html>that > walks through what we do expose and how it is used by the query planner. > > Michael > > On Tue, Sep 20, 2016 at 11:22 AM, McBeath, Darin W (ELS-STL) < > d.mcbe...@elsevier.com> wrote: > >> I’m using Spark 2.0. >> >> I’ve created a dataset from a parquet file and repartition on one of the >> columns (docId) and persist the repartitioned dataset. >> >> val om = ds.repartition($"docId”).persist(StorageLevel.MEMORY_AND_DISK) >> >> When I try to confirm the partitioner, with >> >> om.rdd.partitioner >> >> I get >> >> Option[org.apache.spark.Partitioner] = None >> >> I would have thought it would be HashPartitioner. >> >> Does anyone know why this would be None and not HashPartitioner? >> >> Thanks. >> >> Darin. >> >> >> >