Dear Sparkers,

A while back, I asked how to process non-splittable files in parallel, one file 
per executor. Vadim's suggested "scheduling within an application" approach 
worked out beautifully.

I am now facing the 'opposite' problem:

 - I have a bunch of parquet files to process
 - Once processed I need to output a /single/ file for each input file
 - When I read a parquet file, it gets partitioned over several executors
 - If I want a single output file, I would need to coalesce(1) with potential
   performance issues.

Since my files are relatively small, a single file could be handled by a single 
executor, and several files could be read in parallel, one for each executor.

My question is: how to force my parquet file to be read by a single executor, 
without repartitioning or coalescing of course.

Regards,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to