Re: Number of files to load

Jonathan Coveney Tue, 05 May 2015 16:56:31 -0700

You should check out parquet.

If you can avoid 5minute log files, you can have an hourly (or daily!) MR
job that compacts these. Another nice thing about parquet is it has filter
push down so if you want a smaller range of time you can avoid
deserializing most of the other data


El martes, 5 de mayo de 2015, Rendy Bambang Junior <rendy.b.jun...@gmail.com>
escribió:

> Thanks, Im not aware of splittable file formats.
>
> If that is the case, is number of files affect spark performance? Maybe
> because overhead when opening file?  And that problem is solved by having a
> big sized files in splittable file format?
>
> Any suggestion from your experience how to organize data in splittable
> file format on HDFS for Spark?
>
> Rendy
> On May 6, 2015 1:03 AM, "Jonathan Coveney" <jcove...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');>> wrote:
>
>> "As per my understanding, storing 5minutes file means we could not
>> create RDD more granular than 5minutes."
>>
>> This depends on the file format. Many file formats are splittable (like
>> parquet), meaning that you can seek into various points of the file.
>>
>> 2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior <rendy.b.jun...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','rendy.b.jun...@gmail.com');>>:
>>
>>> Let say I am storing my data in HDFS with folder structure and file
>>> partitioning as per below:
>>> /analytics/2015/05/02/partition-2015-05-02-13-50-0000
>>> Note that new file is created every 5 minutes.
>>>
>>> As per my understanding, storing 5minutes file means we could not create
>>> RDD more granular than 5minutes.
>>>
>>> In the other hand, when we want to aggregate monthly data, number of
>>> file will be enormous (around 84000 files).
>>>
>>> My question is, what are the consideration to say that the number of
>>> file to be loaded to RDD is just 'too many'? Is 84000 'too many' files?
>>>
>>> One thing that comes to my mind is overhead when spark try to open file,
>>> however Im not sure whether it is a valid concern.
>>>
>>> Rendy
>>>
>>
>>

Re: Number of files to load

Reply via email to