Hi, What's the schema interpreted by spark? A compression logic of the spark caching depends on column types.
// maropu On Wed, Nov 16, 2016 at 5:26 PM, Prithish <prith...@gmail.com> wrote: > Thanks for your response. > > I did some more tests and I am seeing that when I have a flatter structure > for my AVRO, the cache memory use is close to the CSV. But, when I use few > levels of nesting, the cache memory usage blows up. This is really critical > for planning the cluster we will be using. To avoid using a larger cluster, > looks like, we will have to consider keeping the structure flat as much as > possible. > > On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com> > wrote: > >> (Adding user@spark back to the discussion) >> >> >> >> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope >> for compression. On the other hand avro and parquet are already compressed >> and just store extra schema info, afaik. Avro and parquet are both going to >> make your data smaller, parquet through compressed columnar storage, and >> avro through its binary data format. >> >> >> >> Next, talking about the 62kb becoming 1224kb. I actually do not see such >> a massive blow up. The avro you shared is 28kb on my system and becomes >> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory >> serialized. Exact same numbers with parquet as well. This is expected >> behavior, if I am not wrong. >> >> >> >> In fact, now that I think about it, even larger blow ups might be valid, >> since your data must have been deserialized from the compressed avro >> format, making it bigger. The order of magnitude of difference in size >> would depend on the type of data you have and how well it was compressable. >> >> >> >> The purpose of these formats is to store data to persistent storage in a >> way that's faster to read from, not to reduce cache-memory usage. >> >> >> >> Maybe others here have more info to share. >> >> >> >> Regards, >> >> Shreya >> >> >> >> Sent from my Windows 10 phone >> >> >> >> *From: *Prithish <prith...@gmail.com> >> *Sent: *Tuesday, November 15, 2016 11:04 PM >> *To: *Shreya Agarwal <shrey...@microsoft.com> >> *Subject: *Re: AVRO File size when caching in-memory >> >> >> I did another test and noting my observations here. These were done with >> the same data in avro and csv formats. >> >> In AVRO, the file size on disk was 62kb and after caching, the in-memory >> size is 1224kb >> In CSV, the file size on disk was 690kb and after caching, the in-memory >> size is 290kb >> >> I'm guessing that the spark caching is not able to compress when the >> source is avro. Not sure if this is just my immature conclusion. Waiting to >> hear your observation. >> >> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote: >> >>> Thanks for your response. >>> >>> I have attached the code (that I ran using the Spark-shell) as well as a >>> sample avro file. After you run this code, the data is cached in memory and >>> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see >>> the size it uses. In this example the size is small, but in my actual >>> scenario, the source file size is 30GB and the in-memory size comes to >>> around 800GB. I am trying to understand if this is expected when using avro >>> or not. >>> >>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <shrey...@microsoft.com >>> > wrote: >>> >>>> I haven’t used Avro ever. But if you can send over a quick sample code, >>>> I can run and see if I repro it and maybe debug. >>>> >>>> >>>> >>>> *From:* Prithish [mailto:prith...@gmail.com] >>>> *Sent:* Tuesday, November 15, 2016 8:44 PM >>>> *To:* Jörn Franke <jornfra...@gmail.com> >>>> *Cc:* User <user@spark.apache.org> >>>> *Subject:* Re: AVRO File size when caching in-memory >>>> >>>> >>>> >>>> Anyone? >>>> >>>> >>>> >>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote: >>>> >>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this >>>> on the latest AWS EMR release. >>>> >>>> >>>> >>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >>>> spark version? Are you using tungsten? >>>> >>>> >>>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote: >>>> > >>>> > Can someone please explain why this happens? >>>> > >>>> > When I read a 600kb AVRO file and cache this in memory (using >>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried >>>> this with different file sizes, and the size in-memory is always >>>> proportionate. I thought Spark compresses when using cacheTable. >>>> >>>> >>>> >>>> >>>> >>> >>> >> > -- --- Takeshi Yamamuro