Hey guys, I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines. I am working with a subset of the GDELT dataset (57 columns, > 250 million rows, but my subset is only 4 million) and trying to query it with Spark SQL.
Since a CSV importer isn't available, my first thought was to use nested case classes (since Scala has a limit of 22 fields, plus there are lots of repeated fields in GDELT). The case classes look like this: case class ActorInfo(Code: String, Name: String, CountryCode: String, KnownGroupCode: String, EthnicCode: String, Religion1Code: String, Religion2Code: String, Type1Code: String, Type2Code: String, Type3Code: String) case class GeoInfo(`Type`: Int, FullName: String, CountryCode: String, ADM1Code: String, Lat: Float, `Long`: Float, FeatureID: Int) case class GDeltRow(EventId: Int, Day: Int, MonthYear: Int, Year: Int, FractionDate: Float, Actor1: ActorInfo, Actor2: ActorInfo, IsRootEvent: Byte, EventCode: String, EventBaseCode: String, EventRootCode: String, QuadClass: Int, GoldsteinScale: Float, NumMentions: Int, NumSources: Int, NumArticles: Int, AvgTone: Float, Actor1Geo: GeoInfo, Actor2Geo: GeoInfo, ActionGeo: GeoInfo, DateAdded: String) Then I use sc.textFile(...) to parse the CSV into an RDD[GDeltRow]. I can query these records without caching. However, if I attempt to cache using registerAsTable() and then sqlContext.cacheTable(...), it is extremely slow (takes 1 hour !!). Any queries using them are also extremely slow. I had tested Spark SQL using a flat structure (no nesting) on a different dataset and the caching and queries were both extremely fast. Thinking that this is an issue with the case classes, I saved them to parquet files and used sqlContext.parquetFile(....), but the slowness is the same. This makes sense, since the internal structure of SchemaRdds is basically the same. In both cases, for both parquet and case classes, the schema is the same. Has anybody else experienced this slowness with nested structures? Is this a known problem and being worked on? The only way to work around this issue I can think of is to convert to JSON, which is tedious, or to construct Parquet files manually (also tedious). thanks, Evan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org