I've just benchmarked Spark and Impala.  Same data (in s3), same query,
same cluster.

Impala has a long load time, since it cannot load directly from s3.  I have
to create a Hive table on s3, then insert from that to an Impala table.
This takes a long time; Spark took about 600s for the query, Impala 250s,
but Impala required 6k seconds to load data from s3.  If you're going to go
the long-initial-load-then-quick-queries route, go for Redshift.  On
equivalent hardware, that took about 4k seconds to load, but then queries
are like 5s each.

Reply via email to