You can union all the df together, then call repartition(). On Sun, May 10, 2015 at 8:34 AM, Peter Aberline <peter.aberl...@gmail.com> wrote: > Hi > > Thanks for the quick response. > > No I'm not using Streaming. Each DataFrame represents tabular data read from > a CSV file. They have the same schema. > > There is also the option of appending each DF to the parquet file, but then > I can't maintain them as separate DF when reading back in without filtering. > > I'll rethink maintaining each CSV file as a single DF. > > Thanks, > Peter > > > On 10 May 2015 at 15:51, ayan guha <guha.a...@gmail.com> wrote: >> >> How did you end up with thousands of df? Are you using streaming? In that >> case you can do foreachRDD and keep merging incoming rdds to single rdd and >> then save it through your own checkpoint mechanism. >> >> If not, please share your use case. >> >> On 11 May 2015 00:38, "Peter Aberline" <peter.aberl...@gmail.com> wrote: >>> >>> Hi >>> >>> I have many thousands of small DataFrames that I would like to save to >>> the one Parquet file to avoid the HDFS 'small files' problem. My >>> understanding is that there is a 1:1 relationship between DataFrames and >>> Parquet files if a single partition is used. >>> >>> Is it possible to have multiple DataFrames within the one Parquet File >>> using PySpark? >>> Or is the only way to achieve this to union the DataFrames into one? >>> >>> Thanks, >>> Peter >>> >>> >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org