Spark SQL/DDF's for production

bipin Tue, 21 Jul 2015 02:28:46 -0700

Hi I want to ask an issue I have faced while using Spark. I load dataframes
from parquet files. Some dataframes' parquet have lots of partitions, >10
million rows.


Running "where id = x" query on dataframe scans all partitions. When saving
to rdd object/parquet there is a partition column. The mentioned "where"
query on the partition column should zero in and only open possible
partitions. Sometimes I need to create index on other columns too to speed
things up. Without index I feel its not production ready.

I see there are two parts to do this:
Ability of spark SQL to create/use indexes - Mentioned as to be implemented
in documentation
Parquet index support- arriving in v2.0 currently it is v1.8

When can we hope to get index support that Spark SQL/catalyst can use. Is
anyone using Spark SQL in production. How did you handle this ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DDF-s-for-production-tp23926.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL/DDF's for production

Reply via email to