An interesting compaction approach of small files is discussed recently http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
AFAIK Spark supports views too. -- Ruslan Dautkhanov On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi < nyigitb...@netflix.com.invalid> wrote: > Hi Spark people, > I have a Hive table that has a lot of small parquet files and I am > creating a data frame out of it to do some processing, but since I have a > large number of splits/files my job creates a lot of tasks, which I don't > want. Basically what I want is the same functionality that Hive provides, > that is, to combine these small input splits into larger ones by specifying > a max split size setting. Is this currently possible with Spark? > > I look at coalesce() but with coalesce I can only control the number > of output files not their sizes. And since the total input dataset size > can vary significantly in my case, I cannot just use a fixed partition > count as the size of each output file can get very large. I then looked for > getting the total input size from an rdd to come up with some heuristic to > set the partition count, but I couldn't find any ways to do it (without > modifying the spark source). > > Any help is appreciated. > > Thanks, > Nezih > > PS: this email is the same as my previous email as I learned that my > previous email ended up as spam for many people since I sent it through > nabble, sorry for the double post. >