I have not profiled this part. But, I think one possible cause is
allocating an array for every inner struct for every row (every struct
value is represented by a Spark SQL row). I will play with it later and see
what I find.


On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan <[email protected]> wrote:

> Hey guys,
>
> I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines.   I am
> working with a subset of the GDELT dataset (57 columns, > 250 million
> rows, but my subset is only 4 million) and trying to query it with
> Spark SQL.
>
> Since a CSV importer isn't available, my first thought was to use
> nested case classes (since Scala has a limit of 22 fields, plus there
> are lots of repeated fields in GDELT).    The case classes look like
> this:
>
> case class ActorInfo(Code: String,
>                      Name: String,
>                      CountryCode: String,
>                      KnownGroupCode: String,
>                      EthnicCode: String, Religion1Code: String,
> Religion2Code: String,
>                      Type1Code: String, Type2Code: String, Type3Code:
> String)
>
> case class GeoInfo(`Type`: Int, FullName: String, CountryCode: String,
> ADM1Code: String, Lat: Float,
>                    `Long`: Float, FeatureID: Int)
>
> case class GDeltRow(EventId: Int, Day: Int, MonthYear: Int, Year: Int,
> FractionDate: Float,
>                     Actor1: ActorInfo, Actor2: ActorInfo,
>                     IsRootEvent: Byte, EventCode: String, EventBaseCode:
> String,
>                     EventRootCode: String, QuadClass: Int,
> GoldsteinScale: Float,
>                     NumMentions: Int, NumSources: Int, NumArticles: Int,
>                     AvgTone: Float,
>                     Actor1Geo: GeoInfo, Actor2Geo: GeoInfo, ActionGeo:
> GeoInfo, DateAdded: String)
>
> Then I use sc.textFile(...) to parse the CSV into an RDD[GDeltRow].
>
> I can query these records without caching.  However, if I attempt to
> cache using registerAsTable() and then sqlContext.cacheTable(...), it
> is extremely slow (takes 1 hour !!).
>
> Any queries using them are also extremely slow.
>
> I had tested Spark SQL using a flat structure (no nesting) on a
> different dataset and the caching and queries were both extremely
> fast.
>
> Thinking that this is an issue with the case classes, I saved them to
> parquet files and used sqlContext.parquetFile(....), but the slowness
> is the same.   This makes sense, since the internal structure of
> SchemaRdds is basically the same.  In both cases, for both parquet and
> case classes, the schema is the same.
>
> Has anybody else experienced this slowness with nested structures?  Is
> this a known problem and being worked on?
>
> The only way to work around this issue I can think of is to convert to
> JSON, which is tedious, or to construct Parquet files manually (also
> tedious).
>
> thanks,
> Evan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to