Re: AVRO File size when caching in-memory

Takeshi Yamamuro Wed, 16 Nov 2016 00:39:07 -0800

Hi,

What's the schema interpreted by spark?
A compression logic of the spark caching depends on column types.


// maropu


On Wed, Nov 16, 2016 at 5:26 PM, Prithish <prith...@gmail.com> wrote:

> Thanks for your response.
>
> I did some more tests and I am seeing that when I have a flatter structure
> for my AVRO, the cache memory use is close to the CSV. But, when I use few
> levels of nesting, the cache memory usage blows up. This is really critical
> for planning the cluster we will be using. To avoid using a larger cluster,
> looks like, we will have to consider keeping the structure flat as much as
> possible.
>
> On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com>
> wrote:
>
>> (Adding user@spark back to the discussion)
>>
>>
>>
>> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope
>> for compression. On the other hand avro and parquet are already compressed
>> and just store extra schema info, afaik. Avro and parquet are both going to
>> make your data smaller, parquet through compressed columnar storage, and
>> avro through its binary data format.
>>
>>
>>
>> Next, talking about the 62kb becoming 1224kb. I actually do not see such
>> a massive blow up. The avro you shared is 28kb on my system and becomes
>> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
>> serialized. Exact same numbers with parquet as well. This is expected
>> behavior, if I am not wrong.
>>
>>
>>
>> In fact, now that I think about it, even larger blow ups might be valid,
>> since your data must have been deserialized from the compressed avro
>> format, making it bigger. The order of magnitude of difference in size
>> would depend on the type of data you have and how well it was compressable.
>>
>>
>>
>> The purpose of these formats is to store data to persistent storage in a
>> way that's faster to read from, not to reduce cache-memory usage.
>>
>>
>>
>> Maybe others here have more info to share.
>>
>>
>>
>> Regards,
>>
>> Shreya
>>
>>
>>
>> Sent from my Windows 10 phone
>>
>>
>>
>> *From: *Prithish <prith...@gmail.com>
>> *Sent: *Tuesday, November 15, 2016 11:04 PM
>> *To: *Shreya Agarwal <shrey...@microsoft.com>
>> *Subject: *Re: AVRO File size when caching in-memory
>>
>>
>> I did another test and noting my observations here. These were done with
>> the same data in avro and csv formats.
>>
>> In AVRO, the file size on disk was 62kb and after caching, the in-memory
>> size is 1224kb
>> In CSV, the file size on disk was 690kb and after caching, the in-memory
>> size is 290kb
>>
>> I'm guessing that the spark caching is not able to compress when the
>> source is avro. Not sure if this is just my immature conclusion. Waiting to
>> hear your observation.
>>
>> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote:
>>
>>> Thanks for your response.
>>>
>>> I have attached the code (that I ran using the Spark-shell) as well as a
>>> sample avro file. After you run this code, the data is cached in memory and
>>> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see
>>> the size it uses. In this example the size is small, but in my actual
>>> scenario, the source file size is 30GB and the in-memory size comes to
>>> around 800GB. I am trying to understand if this is expected when using avro
>>> or not.
>>>
>>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <shrey...@microsoft.com
>>> > wrote:
>>>
>>>> I haven’t used Avro ever. But if you can send over a quick sample code,
>>>> I can run and see if I repro it and maybe debug.
>>>>
>>>>
>>>>
>>>> *From:* Prithish [mailto:prith...@gmail.com]
>>>> *Sent:* Tuesday, November 15, 2016 8:44 PM
>>>> *To:* Jörn Franke <jornfra...@gmail.com>
>>>> *Cc:* User <user@spark.apache.org>
>>>> *Subject:* Re: AVRO File size when caching in-memory
>>>>
>>>>
>>>>
>>>> Anyone?
>>>>
>>>>
>>>>
>>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote:
>>>>
>>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this
>>>> on the latest AWS EMR release.
>>>>
>>>>
>>>>
>>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>> spark version? Are you using tungsten?
>>>>
>>>>
>>>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote:
>>>> >
>>>> > Can someone please explain why this happens?
>>>> >
>>>> > When I read a 600kb AVRO file and cache this in memory (using
>>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>>>> this with different file sizes, and the size in-memory is always
>>>> proportionate. I thought Spark compresses when using cacheTable.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>


-- 
---
Takeshi Yamamuro

Re: AVRO File size when caching in-memory

Reply via email to