Re: LIMIT statement on SparkSQL

2016-10-26 Thread Liz Bai
Sorry for the typo in last mail. Compared with the Query-2, we have questions in Query-1 and Query-3. Also, may I know the difference between CollectLimit and BaseLimit? Thanks so much. Best, Liz > On 26 Oct 2016, at 7:25 PM, Liz Bai wrote: > > Hi all, > > We used Parquet and

LIMIT statement on SparkSQL

2016-10-26 Thread Liz Bai
Hi all, We used Parquet and Spark 2.0 to do the testing. The table below is the summary of what we have found about `Limit` keyword. Query-2 reveals that SparkSQL does early stop upon getting adequate results. But we are curious of Query-1 and Query-2. It seems that, either writing result RDD a

Re: LIMIT issue of SparkSQL

2016-10-23 Thread Liz Bai
Hi all, Let me clarify the problem: Suppose we have a simple table `A` with 100 000 000 records Problem: When we execute sql query ‘select * from A Limit 500`, It scan through all 100 000 000 records. Normal behaviour should be that once 500 records is found, engine stop scanning. Detailed ob

Dynamic Partitions When Writing Parquet

2016-09-01 Thread Liz Bai
Hi there, I have a question about writing Parquet using SparkSQL. Spark 1.4 has already supported writing DataFrames as Parquet files with “partitionBy(colNames: String*)”, as Spark-6561 fixed. Is there any method or plan to write Parquet with dynamic partitions? For example, instead of partiti