Hey guys,

I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines.   I am
working with a subset of the GDELT dataset (57 columns, > 250 million
rows, but my subset is only 4 million) and trying to query it with
Spark SQL.

Since a CSV importer isn't available, my first thought was to use
nested case classes (since Scala has a limit of 22 fields, plus there
are lots of repeated fields in GDELT).    The case classes look like
this:

case class ActorInfo(Code: String,
                     Name: String,
                     CountryCode: String,
                     KnownGroupCode: String,
                     EthnicCode: String, Religion1Code: String,
Religion2Code: String,
                     Type1Code: String, Type2Code: String, Type3Code: String)

case class GeoInfo(`Type`: Int, FullName: String, CountryCode: String,
ADM1Code: String, Lat: Float,
                   `Long`: Float, FeatureID: Int)

case class GDeltRow(EventId: Int, Day: Int, MonthYear: Int, Year: Int,
FractionDate: Float,
                    Actor1: ActorInfo, Actor2: ActorInfo,
                    IsRootEvent: Byte, EventCode: String, EventBaseCode: String,
                    EventRootCode: String, QuadClass: Int,
GoldsteinScale: Float,
                    NumMentions: Int, NumSources: Int, NumArticles: Int,
                    AvgTone: Float,
                    Actor1Geo: GeoInfo, Actor2Geo: GeoInfo, ActionGeo:
GeoInfo, DateAdded: String)

Then I use sc.textFile(...) to parse the CSV into an RDD[GDeltRow].

I can query these records without caching.  However, if I attempt to
cache using registerAsTable() and then sqlContext.cacheTable(...), it
is extremely slow (takes 1 hour !!).

Any queries using them are also extremely slow.

I had tested Spark SQL using a flat structure (no nesting) on a
different dataset and the caching and queries were both extremely
fast.

Thinking that this is an issue with the case classes, I saved them to
parquet files and used sqlContext.parquetFile(....), but the slowness
is the same.   This makes sense, since the internal structure of
SchemaRdds is basically the same.  In both cases, for both parquet and
case classes, the schema is the same.

Has anybody else experienced this slowness with nested structures?  Is
this a known problem and being worked on?

The only way to work around this issue I can think of is to convert to
JSON, which is tedious, or to construct Parquet files manually (also
tedious).

thanks,
Evan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to