Hi I want to ask an issue I have faced while using Spark. I load dataframes from parquet files. Some dataframes' parquet have lots of partitions, >10 million rows.
Running "where id = x" query on dataframe scans all partitions. When saving to rdd object/parquet there is a partition column. The mentioned "where" query on the partition column should zero in and only open possible partitions. Sometimes I need to create index on other columns too to speed things up. Without index I feel its not production ready. I see there are two parts to do this: Ability of spark SQL to create/use indexes - Mentioned as to be implemented in documentation Parquet index support- arriving in v2.0 currently it is v1.8 When can we hope to get index support that Spark SQL/catalyst can use. Is anyone using Spark SQL in production. How did you handle this ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDF-s-for-production-tp23926.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org