Re: SparkSQL AVRO

2015-12-07 Thread Deenar Toraskar
By default Spark will create one file per partition. Spark SQL defaults to using 200 partitions. If you want to reduce the number of files written out, repartition your dataframe using repartition and give it the desired number of partitions. originalDF.repartition(10).write.avro("masterNew.avro")

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov
How many reducers you had that created those avro files? Each reducer very likely creates its own avro part- file. We normally use Parquet, but it should be the same for Avro, so this might be relevant http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-ta