I have not profiled this part. But, I think one possible cause is allocating an array for every inner struct for every row (every struct value is represented by a Spark SQL row). I will play with it later and see what I find.
On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan <[email protected]> wrote: > Hey guys, > > I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines. I am > working with a subset of the GDELT dataset (57 columns, > 250 million > rows, but my subset is only 4 million) and trying to query it with > Spark SQL. > > Since a CSV importer isn't available, my first thought was to use > nested case classes (since Scala has a limit of 22 fields, plus there > are lots of repeated fields in GDELT). The case classes look like > this: > > case class ActorInfo(Code: String, > Name: String, > CountryCode: String, > KnownGroupCode: String, > EthnicCode: String, Religion1Code: String, > Religion2Code: String, > Type1Code: String, Type2Code: String, Type3Code: > String) > > case class GeoInfo(`Type`: Int, FullName: String, CountryCode: String, > ADM1Code: String, Lat: Float, > `Long`: Float, FeatureID: Int) > > case class GDeltRow(EventId: Int, Day: Int, MonthYear: Int, Year: Int, > FractionDate: Float, > Actor1: ActorInfo, Actor2: ActorInfo, > IsRootEvent: Byte, EventCode: String, EventBaseCode: > String, > EventRootCode: String, QuadClass: Int, > GoldsteinScale: Float, > NumMentions: Int, NumSources: Int, NumArticles: Int, > AvgTone: Float, > Actor1Geo: GeoInfo, Actor2Geo: GeoInfo, ActionGeo: > GeoInfo, DateAdded: String) > > Then I use sc.textFile(...) to parse the CSV into an RDD[GDeltRow]. > > I can query these records without caching. However, if I attempt to > cache using registerAsTable() and then sqlContext.cacheTable(...), it > is extremely slow (takes 1 hour !!). > > Any queries using them are also extremely slow. > > I had tested Spark SQL using a flat structure (no nesting) on a > different dataset and the caching and queries were both extremely > fast. > > Thinking that this is an issue with the case classes, I saved them to > parquet files and used sqlContext.parquetFile(....), but the slowness > is the same. This makes sense, since the internal structure of > SchemaRdds is basically the same. In both cases, for both parquet and > case classes, the schema is the same. > > Has anybody else experienced this slowness with nested structures? Is > this a known problem and being worked on? > > The only way to work around this issue I can think of is to convert to > JSON, which is tedious, or to construct Parquet files manually (also > tedious). > > thanks, > Evan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
