Michael, There is only one schema: both versions have 200 string columns in one file.
On Mon, Apr 20, 2015 at 9:08 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > Now this is very important: > > > > “Normal RDDs” refers to “batch RDDs”. However the default in-memory > Serialization of RDDs which are part of DSTream is “Srialized” rather than > actual (hydrated) Objects. The Spark documentation states that > “Serialization” is required for space and garbage collection efficiency (but > creates higher CPU load) – which makes sense consider the large number of > RDDs which get discarded in a streaming app > > > > So what does Data Bricks actually recommend as Object Oriented model for RDD > elements used in Spark Streaming apps – flat or not and can you provide a > detailed description / spec of both > > > > From: Michael Armbrust [mailto:mich...@databricks.com] > Sent: Thursday, April 16, 2015 7:23 PM > To: Evo Eftimov > Cc: Christian Perez; user > > > Subject: Re: Super slow caching in 1.3? > > > > Here are the types that we specialize, other types will be much slower. > This is only for Spark SQL, normal RDDs do not serialize data that is > cached. I'll also not that until yesterday we were missing FloatType > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala#L154 > > > > Christian, can you provide the schema of the fast and slow datasets? > > > > On Thu, Apr 16, 2015 at 10:14 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > > Michael what exactly do you mean by "flattened" version/structure here e.g.: > > 1. An Object with only primitive data types as attributes > 2. An Object with no more than one level of other Objects as attributes > 3. An Array/List of primitive types > 4. An Array/List of Objects > > This question is in general about RDDs not necessarily RDDs in the context > of SparkSQL > > When answering can you also score how bad the performance of each of the > above options is > > > -----Original Message----- > From: Christian Perez [mailto:christ...@svds.com] > Sent: Thursday, April 16, 2015 6:09 PM > To: Michael Armbrust > Cc: user > Subject: Re: Super slow caching in 1.3? > > Hi Michael, > > Good question! We checked 1.2 and found that it is also slow cacheing the > same flat parquet file. Caching other file formats of the same data were > faster by up to a factor of ~2. Note that the parquet file was created in > Impala but the other formats were written by Spark SQL. > > Cheers, > > Christian > > On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust <mich...@databricks.com> > wrote: >> Do you think you are seeing a regression from 1.2? Also, are you >> caching nested data or flat rows? The in-memory caching is not really >> designed for nested data and so performs pretty slowly here (its just >> falling back to kryo and even then there are some locking issues). >> >> If so, would it be possible to try caching a flattened version? >> >> CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable >> >> On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez <christ...@svds.com> >> wrote: >>> >>> Hi all, >>> >>> Has anyone else noticed very slow time to cache a Parquet file? It >>> takes 14 s per 235 MB (1 block) uncompressed node local Parquet file >>> on M2 EC2 instances. Or are my expectations way off... >>> >>> Cheers, >>> >>> Christian >>> >>> -- >>> Christian Perez >>> Silicon Valley Data Science >>> Data Analyst >>> christ...@svds.com >>> @cp_phd >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>> additional commands, e-mail: user-h...@spark.apache.org >>> >> > > > > -- > Christian Perez > Silicon Valley Data Science > Data Analyst > christ...@svds.com > @cp_phd > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > > -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org