Fwd: Pros and Cons

Mich Talebzadeh Fri, 27 May 2016 09:15:44 -0700

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



---------- Forwarded message ----------
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Date: 27 May 2016 at 17:09
Subject: Re: Pros and Cons
To: Teng Qiu <teng...@gmail.com>
Cc: Ted Yu <yuzhih...@gmail.com>, Koert Kuipers <ko...@tresata.com>, Jörn
Franke <jornfra...@gmail.com>, user <user@spark.apache.org>, Aakash Basu <
raj2coo...@gmail.com>, Reynold Xin <r...@databricks.com>


not worth spending time really.

The only version that works is Spark 1.3.1 with Hive 2
To be perfectly honest deploying Spark as Hive engine only requires certain
Spark capabilities that I think 1.3.1 is OK with it. Remember we are
talking about engine not HQL etc. That is all provided by Hive itself.

To me a gain in performance at least 2 compared to MR is perfectly
acceptable.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 27 May 2016 at 16:58, Teng Qiu <teng...@gmail.com> wrote:

> tried spark 2.0.0 preview, but no assembly jar there... then just gave
> up... :p
>
> 2016-05-27 17:39 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
> > Teng:
> > Why not try out the 2.0 SANPSHOT build ?
> >
> > Thanks
> >
> >> On May 27, 2016, at 7:44 AM, Teng Qiu <teng...@gmail.com> wrote:
> >>
> >> ah, yes, the version is another mess!... no vendor's product
> >>
> >> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work.
> >>
> >> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this
> >> from hive side https://issues.apache.org/jira/browse/HIVE-13301
> >>
> >> the jackson-databind lib from calcite-avatica.jar is too old.
> >>
> >> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0
> released.
> >>
> >>
> >> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
> >>> Hi Teng,
> >>>
> >>>
> >>> what version of spark are using as the execution engine. are you using
> a
> >>> vendor's product here?
> >>>
> >>> thanks
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>>
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>>
> >>>
> >>>> On 27 May 2016 at 13:05, Teng Qiu <teng...@gmail.com> wrote:
> >>>>
> >>>> I agree with Koert and Reynold, spark works well with large dataset
> now.
> >>>>
> >>>> back to the original discussion, compare SparkSQL vs Hive in Spark vs
> >>>> Spark API.
> >>>>
> >>>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
> >>>> SparkSQL is pure SQL, and Spark API is language for writing stored
> >>>> procedure
> >>>>
> >>>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that
> >>>> use spark as spark as execution engine, SparkSQL uses Hive's syntax,
> >>>> so as a language, i would say they are almost the same.
> >>>>
> >>>> but Hive on Spark has a much better support for hive features,
> >>>> especially hiveserver2 and security features, hive features in
> >>>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but
> >>>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't
> >>>> work with hivevar and hiveconf argument anymore, and the username for
> >>>> login via jdbc doesn't work either...
> >>>> see https://issues.apache.org/jira/browse/SPARK-13983
> >>>>
> >>>> i believe hive support in spark project is really very low priority
> >>>> stuff...
> >>>>
> >>>> sadly Hive on spark integration is not that easy, there are a lot of
> >>>> dependency conflicts... such as
> >>>> https://issues.apache.org/jira/browse/HIVE-13301
> >>>>
> >>>> our requirement is using spark with hiveserver2 in a secure way (with
> >>>> authentication and authorization), currently SparkSQL alone can not
> >>>> provide this, we are using ranger/sentry + Hive on Spark.
> >>>>
> >>>> hope this can help you to get a better idea which direction you
> should go.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Teng
> >>>>
> >>>>
> >>>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <ko...@tresata.com>:
> >>>>> We do disk-to-disk iterative algorithms in spark all the time, on
> >>>>> datasets
> >>>>> that do not fit in memory, and it works well for us. I usually have
> to
> >>>>> do
> >>>>> some tuning of number of partitions for a new dataset but that's
> about
> >>>>> it in
> >>>>> terms of inconveniences.
> >>>>>
> >>>>> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
> >>>>>
> >>>>>
> >>>>> Spark can handle this true, but it is optimized for the idea that it
> >>>>> works
> >>>>> it works on the same full dataset in-memory due to the underlying
> nature
> >>>>> of
> >>>>> machine learning algorithms (iterative). Of course, you can spill
> over,
> >>>>> but
> >>>>> that you should avoid.
> >>>>>
> >>>>> That being said you should have read my final sentence about this.
> Both
> >>>>> systems develop and change.
> >>>>>
> >>>>>
> >>>>> On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote:
> >>>>>
> >>>>>
> >>>>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Spark is more for machine learning working iteravely over the whole
> >>>>>> same
> >>>>>> dataset in memory. Additionally it has streaming and graph
> processing
> >>>>>> capabilities that can be used together.
> >>>>>
> >>>>>
> >>>>> Hi Jörn,
> >>>>>
> >>>>> The first part is actually no true. Spark can handle data far greater
> >>>>> than
> >>>>> the aggregate memory available on a cluster. The more recent versions
> >>>>> (1.3+)
> >>>>> of Spark have external operations for almost all built-in operators,
> and
> >>>>> while things may not be perfect, those external operators are
> becoming
> >>>>> more
> >>>>> and more robust with each version of Spark.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
>

Fwd: Pros and Cons

Reply via email to