Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com ---------- Forwarded message ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: 27 May 2016 at 17:09 Subject: Re: Pros and Cons To: Teng Qiu <teng...@gmail.com> Cc: Ted Yu <yuzhih...@gmail.com>, Koert Kuipers <ko...@tresata.com>, Jörn Franke <jornfra...@gmail.com>, user <user@spark.apache.org>, Aakash Basu < raj2coo...@gmail.com>, Reynold Xin <r...@databricks.com> not worth spending time really. The only version that works is Spark 1.3.1 with Hive 2 To be perfectly honest deploying Spark as Hive engine only requires certain Spark capabilities that I think 1.3.1 is OK with it. Remember we are talking about engine not HQL etc. That is all provided by Hive itself. To me a gain in performance at least 2 compared to MR is perfectly acceptable. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 27 May 2016 at 16:58, Teng Qiu <teng...@gmail.com> wrote: > tried spark 2.0.0 preview, but no assembly jar there... then just gave > up... :p > > 2016-05-27 17:39 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > > Teng: > > Why not try out the 2.0 SANPSHOT build ? > > > > Thanks > > > >> On May 27, 2016, at 7:44 AM, Teng Qiu <teng...@gmail.com> wrote: > >> > >> ah, yes, the version is another mess!... no vendor's product > >> > >> i tried hadoop 2.6.2, hive 1.2.1 with spark 1.6.1, doesn't work. > >> > >> hadoop 2.6.2, hive 2.0.1 with spark 1.6.1, works, but need to fix this > >> from hive side https://issues.apache.org/jira/browse/HIVE-13301 > >> > >> the jackson-databind lib from calcite-avatica.jar is too old. > >> > >> will try hadoop 2.7, hive 2.0.1 and spark 2.0.0, when spark 2.0.0 > released. > >> > >> > >> 2016-05-27 16:16 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > >>> Hi Teng, > >>> > >>> > >>> what version of spark are using as the execution engine. are you using > a > >>> vendor's product here? > >>> > >>> thanks > >>> > >>> Dr Mich Talebzadeh > >>> > >>> > >>> > >>> LinkedIn > >>> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >>> > >>> > >>> > >>> http://talebzadehmich.wordpress.com > >>> > >>> > >>> > >>> > >>>> On 27 May 2016 at 13:05, Teng Qiu <teng...@gmail.com> wrote: > >>>> > >>>> I agree with Koert and Reynold, spark works well with large dataset > now. > >>>> > >>>> back to the original discussion, compare SparkSQL vs Hive in Spark vs > >>>> Spark API. > >>>> > >>>> SparkSQL vs Spark API you can simply imagine you are in RDBMS world, > >>>> SparkSQL is pure SQL, and Spark API is language for writing stored > >>>> procedure > >>>> > >>>> Hive on Spark is similar to SparkSQL, it is a pure SQL interface that > >>>> use spark as spark as execution engine, SparkSQL uses Hive's syntax, > >>>> so as a language, i would say they are almost the same. > >>>> > >>>> but Hive on Spark has a much better support for hive features, > >>>> especially hiveserver2 and security features, hive features in > >>>> SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but > >>>> in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't > >>>> work with hivevar and hiveconf argument anymore, and the username for > >>>> login via jdbc doesn't work either... > >>>> see https://issues.apache.org/jira/browse/SPARK-13983 > >>>> > >>>> i believe hive support in spark project is really very low priority > >>>> stuff... > >>>> > >>>> sadly Hive on spark integration is not that easy, there are a lot of > >>>> dependency conflicts... such as > >>>> https://issues.apache.org/jira/browse/HIVE-13301 > >>>> > >>>> our requirement is using spark with hiveserver2 in a secure way (with > >>>> authentication and authorization), currently SparkSQL alone can not > >>>> provide this, we are using ranger/sentry + Hive on Spark. > >>>> > >>>> hope this can help you to get a better idea which direction you > should go. > >>>> > >>>> Cheers, > >>>> > >>>> Teng > >>>> > >>>> > >>>> 2016-05-27 2:36 GMT+02:00 Koert Kuipers <ko...@tresata.com>: > >>>>> We do disk-to-disk iterative algorithms in spark all the time, on > >>>>> datasets > >>>>> that do not fit in memory, and it works well for us. I usually have > to > >>>>> do > >>>>> some tuning of number of partitions for a new dataset but that's > about > >>>>> it in > >>>>> terms of inconveniences. > >>>>> > >>>>> On May 26, 2016 2:07 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: > >>>>> > >>>>> > >>>>> Spark can handle this true, but it is optimized for the idea that it > >>>>> works > >>>>> it works on the same full dataset in-memory due to the underlying > nature > >>>>> of > >>>>> machine learning algorithms (iterative). Of course, you can spill > over, > >>>>> but > >>>>> that you should avoid. > >>>>> > >>>>> That being said you should have read my final sentence about this. > Both > >>>>> systems develop and change. > >>>>> > >>>>> > >>>>> On 25 May 2016, at 22:14, Reynold Xin <r...@databricks.com> wrote: > >>>>> > >>>>> > >>>>> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke <jornfra...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> Spark is more for machine learning working iteravely over the whole > >>>>>> same > >>>>>> dataset in memory. Additionally it has streaming and graph > processing > >>>>>> capabilities that can be used together. > >>>>> > >>>>> > >>>>> Hi Jörn, > >>>>> > >>>>> The first part is actually no true. Spark can handle data far greater > >>>>> than > >>>>> the aggregate memory available on a cluster. The more recent versions > >>>>> (1.3+) > >>>>> of Spark have external operations for almost all built-in operators, > and > >>>>> while things may not be perfect, those external operators are > becoming > >>>>> more > >>>>> and more robust with each version of Spark. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> >