Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread vincent gromakowski
Barre metal servers with 2 dedicated clusters (spark and Cassandra) versus 1 cluster with colocation. In both case 10 gbps dedicated network. Le sam. 14 avr. 2018 à 23:17, Mich Talebzadeh a écrit : > Thanks Vincent. You mean 20 times improvement with data being local as > opposed to Spark runnin

Spark-ML : Streaming library for Factorization Machine (FM/FFM)

2018-04-14 Thread Sundeep Kumar Mehta
Hi All, Any library/ github project to use factorization machine or field aware factorization machine via online learning for continuous training ? Request you to please share your thoughts on this. Regards Sundeep

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-14 Thread Gourav Sengupta
Hi, if you start spark or pyspark from command line and then add the option --jars and see that things are working fine, then it means that you will have to add the jar either to SPARK_HOME jars file or modify the spark-env file to include the path pointing to the location where the jar file is st

Re: Does partition by and order by works only in stateful case?

2018-04-14 Thread Gourav Sengupta
Hi, My sincere apologies for adding my question to this chain. For some reason, I am unable to see the messages which I write to the group ever appear back in it and I think that this might be related in a way that shows a few differences between traditional operations and Spark Streaming operatio

Re: Does partition by and order by works only in stateful case?

2018-04-14 Thread kant kodali
got it! Thanks. On Thu, Apr 12, 2018 at 7:53 PM, Tathagata Das wrote: > The traditional SQL windows with `over` is not supported in streaming. > Only time-based windows, that is, `window("timestamp", "10 minutes")` is > supported in streaming. > > On Thu, Apr 12, 2018 at 7:34 PM, kant kodali wr

Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread Mich Talebzadeh
Thanks Vincent. You mean 20 times improvement with data being local as opposed to Spark running on compute nodes? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

when can we expect multiple aggregations to be supported in spark structured streaming?

2018-04-14 Thread kant kodali
Hi All, when can we expect multiple aggregations to be supported in spark structured streaming? For example, id | amount | my_timestamp -- 1 | 5 | 2018-04-01T01:00:00.000Z 1 | 10 | 2018-04-01T01:10:00.000Z 2 | 20

Re: Performance of Spark when the compute and storage are separated

2018-04-14 Thread vincent gromakowski
Not with hadoop but with Cassandra, i have seen 20x data locality improvement on partitioned optimized spark jobs Le sam. 14 avr. 2018 à 21:17, Mich Talebzadeh a écrit : > Hi, > > This is a sort of your mileage varies type question. > > In a classic Hadoop cluster, one has data locality when eac

Performance of Spark when the compute and storage are separated

2018-04-14 Thread Mich Talebzadeh
Hi, This is a sort of your mileage varies type question. In a classic Hadoop cluster, one has data locality when each node includes the Spark libraries and HDFS data. this helps certain queries like interactive BI. However running Spark over remote storage say Isilon scaled out NAS instead of LO

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-14 Thread Jason Boorn
Ok great I’ll give that a shot - Thanks for all the help > On Apr 14, 2018, at 12:08 PM, Gene Pang wrote: > > Yes, I think that is the case. I haven't tried that before, but it should > work. > > Thanks, > Gene > > On Fri, Apr 13, 2018 at 11:32 AM, Jason Boorn > wro

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-14 Thread Gene Pang
Yes, I think that is the case. I haven't tried that before, but it should work. Thanks, Gene On Fri, Apr 13, 2018 at 11:32 AM, Jason Boorn wrote: > Hi Gene - > > Are you saying that I just need to figure out how to get the Alluxio jar > into the classpath of my parent application? If it shows