Michael,

There is only one schema: both versions have 200 string columns in one file.

On Mon, Apr 20, 2015 at 9:08 AM, Evo Eftimov <evo.efti...@isecc.com> wrote:
> Now this is very important:
>
>
>
> “Normal RDDs” refers to “batch RDDs”. However the default in-memory
> Serialization of RDDs which are part of DSTream is “Srialized” rather than
> actual (hydrated) Objects. The Spark documentation states that
> “Serialization” is required for space and garbage collection efficiency (but
> creates higher CPU load) – which makes sense consider the large number of
> RDDs which get discarded in a streaming app
>
>
>
> So what does Data Bricks actually recommend as Object Oriented model for RDD
> elements used in Spark Streaming apps – flat or not and can you provide a
> detailed description / spec of both
>
>
>
> From: Michael Armbrust [mailto:mich...@databricks.com]
> Sent: Thursday, April 16, 2015 7:23 PM
> To: Evo Eftimov
> Cc: Christian Perez; user
>
>
> Subject: Re: Super slow caching in 1.3?
>
>
>
> Here are the types that we specialize, other types will be much slower.
> This is only for Spark SQL, normal RDDs do not serialize data that is
> cached.  I'll also not that until yesterday we were missing FloatType
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala#L154
>
>
>
> Christian, can you provide the schema of the fast and slow datasets?
>
>
>
> On Thu, Apr 16, 2015 at 10:14 AM, Evo Eftimov <evo.efti...@isecc.com> wrote:
>
> Michael what exactly do you mean by "flattened" version/structure here e.g.:
>
> 1. An Object with only primitive data types as attributes
> 2. An Object with  no more than one level of other Objects as attributes
> 3. An Array/List of primitive types
> 4. An Array/List of Objects
>
> This question is in general about RDDs not necessarily RDDs in the context
> of SparkSQL
>
> When answering can you also score how bad the performance of each of the
> above options is
>
>
> -----Original Message-----
> From: Christian Perez [mailto:christ...@svds.com]
> Sent: Thursday, April 16, 2015 6:09 PM
> To: Michael Armbrust
> Cc: user
> Subject: Re: Super slow caching in 1.3?
>
> Hi Michael,
>
> Good question! We checked 1.2 and found that it is also slow cacheing the
> same flat parquet file. Caching other file formats of the same data were
> faster by up to a factor of ~2. Note that the parquet file was created in
> Impala but the other formats were written by Spark SQL.
>
> Cheers,
>
> Christian
>
> On Mon, Apr 6, 2015 at 6:17 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>> Do you think you are seeing a regression from 1.2?  Also, are you
>> caching nested data or flat rows?  The in-memory caching is not really
>> designed for nested data and so performs pretty slowly here (its just
>> falling back to kryo and even then there are some locking issues).
>>
>> If so, would it be possible to try caching a flattened version?
>>
>> CACHE TABLE flattenedTable AS SELECT ... FROM parquetTable
>>
>> On Mon, Apr 6, 2015 at 5:00 PM, Christian Perez <christ...@svds.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> Has anyone else noticed very slow time to cache a Parquet file? It
>>> takes 14 s per 235 MB (1 block) uncompressed node local Parquet file
>>> on M2 EC2 instances. Or are my expectations way off...
>>>
>>> Cheers,
>>>
>>> Christian
>>>
>>> --
>>> Christian Perez
>>> Silicon Valley Data Science
>>> Data Analyst
>>> christ...@svds.com
>>> @cp_phd
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>> additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>
>
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christ...@svds.com
> @cp_phd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to