First of all, select * is not a useful SQL to evaluate. Very rarely would a
user require all 362K records for visual analysis.
Second, collect() forces movement of all data from executors to the driver.
Instead write it out to some other table or to HDFS.
Also Spark is more beneficial when you ha
Hi,
I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in
spark-shell.
Now, I am experiencing reversed performance by Spark-Sql over Hive.
By default Hive gives result back in 27 seconds for plain select * query on
1 GB dataset containing 3623203 records, while spark-sql gives back in