Hi

Thanks for the quick response.

No I'm not using Streaming. Each DataFrame represents tabular data read
from a CSV file. They have the same schema.

There is also the option of appending each DF to the parquet file, but then
I can't maintain them as separate DF when reading back in without filtering.

I'll rethink maintaining each CSV file as a single DF.

Thanks,
Peter


On 10 May 2015 at 15:51, ayan guha <guha.a...@gmail.com> wrote:

> How did you end up with thousands of df? Are you using streaming?  In that
> case you can do foreachRDD and keep merging incoming rdds to single rdd and
> then save it through your own checkpoint mechanism.
>
> If not, please share your use case.
> On 11 May 2015 00:38, "Peter Aberline" <peter.aberl...@gmail.com> wrote:
>
>> Hi
>>
>> I have many thousands of small DataFrames that I would like to save to
>> the one Parquet file to avoid the HDFS 'small files' problem. My
>> understanding is that there is a 1:1 relationship between DataFrames and
>> Parquet files if a single partition is used.
>>
>> Is it possible to have multiple DataFrames within the one Parquet File
>> using PySpark?
>> Or is the only way to achieve this to union the DataFrames into one?
>>
>> Thanks,
>> Peter
>>
>>
>>

Reply via email to