I agree with Koert and Reynold, spark works well with large dataset now.

back to the original discussion, compare SparkSQL vs Hive in Spark vs Spark API.

SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
SparkSQL is pure SQL, and Spark API is language for writing stored
procedure

Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
use spark as spark as execution engine, SparkSQL uses Hive's syntax,
so as a language, i would say they are almost the same.

but Hive on Spark has a much better support for hive features,
especially hiveserver2 and security features, hive features in
SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
work with hivevar and hiveconf argument anymore, and the username for
login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983

i believe hive support in spark project is really very low priority stuff...

sadly Hive on spark integration is not that easy, there are a lot of
dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301

our requirement is using spark with hiveserver2 in a secure way (with
authentication and authorization), currently SparkSQL alone can not
provide this, we are using ranger/sentry + Hive on Spark.

hope this can help you to get a better idea which direction you should go.

Cheers,

Teng


2016-05-27 2:36 GMT+02:00 Koert Kuipers <ko...@tresata.com>:
> We do disk-to-disk iterative algorithms in spark all the time, on datasets
> that do not fit in memory, and it works well for us. I usually have to do
> some tuning of number of partitions for a new dataset but that's about it in
> terms of inconveniences.
>
> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>
>
> Spark can handle this true, but it is optimized for the idea that it works
> it works on the same full dataset in-memory due to the underlying nature of
> machine learning algorithms (iterative). Of course, you can spill over, but
> that you should avoid.
>
> That being said you should have read my final sentence about this. Both
> systems develop and change.
>
>
> On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote:
>
>
> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>> Spark is more for machine learning working iteravely over the whole same
>> dataset in memory. Additionally it has streaming and graph processing
>> capabilities that can be used together.
>
>
> Hi Jörn,
>
> The first part is actually no true. Spark can handle data far greater than
> the aggregate memory available on a cluster. The more recent versions (1.3+)
> of Spark have external operations for almost all built-in operators, and
> while things may not be perfect, those external operators are becoming more
> and more robust with each version of Spark.
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to