subject:"Re\: Efficient filtering on Spark SQL dataframes with ordered keys"

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-02 Thread Michael David Pedersen

Awesome, thank you Michael for the detailed example! I'll look into whether I can use this approach for my use case. If so, I could avoid the overhead of repeatedly registering a temp table for one-off queries, instead registering the table once and relying on the injected strategy. Don't know how

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael Armbrust

registerTempTable is backed by an in-memory hash table that maps table name (a string) to a logical query plan. Fragments of that logical query plan may or may not be cached (but calling register alone will not result in any materialization of results). In Spark 2.0 we renamed this function to cr

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh

it would be great if we establish this. I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are private to the session and are put in a hidden staging directory as below /user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-47_319_5605745346163312826-10 and removed when the sess

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen

Thanks for the link, I hadn't come across this. According to https://forums.databricks.com/questions/400/what-is-the- > difference-between-registertemptable-a.html > > and I quote > > "registerTempTable() > > registerTempTable() creates an in-memory table that is scoped to the > cluster in which i

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh

A bit of gray area here I am afraid, I was trying to experiment with it According to https://forums.databricks.com/questions/400/what-is-the-difference-between-registertemptable-a.html and I quote "registerTempTable() registerTempTable() creates an in-memory table that is scoped to the cluster

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen

Hi again Mich, "But the thing is that I don't explicitly cache the tempTables ..". > > I believe tempTable is created in-memory and is already cached > That surprises me since there is a sqlContext.cacheTable method to explicitly cache a table in memory. Or am I missing something? This could expl

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh

well I suppose one can drop tempTable as below scala> df.registerTempTable("tmp") scala> spark.sql("select count(1) from tmp").show ++ |count(1)| ++ | 904180| ++ scala> spark.sql("drop table if exists tmp") res22: org.apache.spark.sql.DataFrame = [] Also your point "B

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen

Hi Mich, Thank you again for your reply. As I see you are caching the table already sorted > > val keyValRDDSorted = keyValRDD.sortByKey().cache > > and the next stage is you are creating multiple tempTables (different > ranges) that cache a subset of rows already cached in RDD. The data stored >

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh

Hi Michael, As I see you are caching the table already sorted val keyValRDDSorted = keyValRDD.sortByKey().cache and the next stage is you are creating multiple tempTables (different ranges) that cache a subset of rows already cached in RDD. The data stored in tempTable is in Hive columnar format

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen

Hi Mich, Thank you for your quick reply! What type of table is the underlying table? Is it Hbase, Hive ORC or what? > It is a custom datasource, but ultimately backed by HBase. > By Key you mean a UNIQUE ID or something similar and then you do multiple > scans on the tempTable which stores dat

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh

Hi Michael. What type of table is the underlying table? Is it Hbase, Hive ORC or what? By Key you mean a UNIQUE ID or something similar and then you do multiple scans on the tempTable which stores data using in-memory columnar format. That is the optimisation of tempTable storage as far as I kno

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Re: Efficient filtering on Spark SQL dataframes with ordered keys

11 matches

Site Navigation

Mail list logo

Footer information