Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Mayur Rustagi
Another option could be to use a sketch to get approx median(extendable to quantiles as well) for a large number of tasks sketch would give accurate value as tasks are few, for larger task the benefit will be good. Regards, Mayur Rustagi Ph: +1 (650) 937 9673 http://www.sigmoid.com <h

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-07 Thread Mayur Rustagi
> > We should take a vector instead giving the user flexibility to decide > data source/ type What do you mean by vector datatype exactly? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Wed, Nov 5,

Update on Pig on Spark initiative

2014-08-27 Thread Mayur Rustagi
hesh Kalakoti (Sigmoid Analytics) Not to mention Spark & Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi>

Re: Akka usage in Spark

2014-08-21 Thread Mayur Rustagi
looking to use them as a local/distributed message bus Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Aug 21, 2014 at 4:04 AM, Debasish Das wrote: > Yeah that's the one we discussed...sorry I pointed

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread Mayur Rustagi
Interesting, clickstream data would have its own window concept based on session of User , I can imagine windows would change across streams but wouldnt they large be domain specific in Nature? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.

Re: balancing RDDs

2014-06-26 Thread Mayur Rustagi
input partition across nodes & scheduling preference of task related to unbalanced partition to different nodes.. I am not sure if RDD can influence location of tasks /partition location. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twit

Google Cloud Engine adds out of the box Spark/Shark support

2014-06-26 Thread Mayur Rustagi
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/EfQms8tK5cE I suspect they are using thr own builds.. has anybody had a chance to look at it? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi>

Re: balancing RDDs

2014-06-24 Thread Mayur Rustagi
This would be really useful. Especially for Shark where shift of partitioning effects all subsequent queries unless task scheduling time beats spark.locality.wait. Can cause overall low performance for all subsequent tasks. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: Checkpointed RDD still causing StackOverflow

2014-06-24 Thread Mayur Rustagi
Do not call collect as that will perform materialization as well as transfer of data to driver (might actually cause driver to fail if the data is huge). You have to materialize the RDD in some way(call save, count, collect). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Fwd: Monitoring / Instrumenting jobs in 1.0

2014-05-31 Thread Mayur Rustagi
We have a json feed of spark application interface that we use for easier instrumentation & monitoring. Has that been considered/found relevant? Already sent as a pull request to 0.9.0, would that work or should we update it to 1.0.0? Mayur Rustagi Ph: +1 (760) 203 3257

Re: Better option to use Querying in Spark

2014-05-05 Thread Mayur Rustagi
usecase directly. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Tue, May 6, 2014 at 11:22 AM, prabeesh k wrote: > Hi, > > I have seen three different ways to query data from Spark > >1. Default S

Re: Spark on wikipedia dataset

2014-04-23 Thread Mayur Rustagi
Huge joins would be interesting. I do all my demos on wikipedia dataset for Shark. Joins are typical pain to showcase & show off :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Wed, Apr 23, 2014 at 10:33 AM,

Re: Building Spark AMI

2014-04-11 Thread Mayur Rustagi
I am creating one fully configured & synced one. But you still need to send over configuration. Do you plan to use chef for that ? On Apr 10, 2014 6:58 PM, "Jim Ancona" wrote: > Are there scripts to build the AMI used by the spark-ec2 script? > > Alternatively, is there a place to download the A

Re: Custom RDD

2014-03-10 Thread Mayur Rustagi
copy paste? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Mon, Mar 10, 2014 at 12:30 PM, David Thomas wrote: > Is there any guide available on creating a custom RDD? >

Re: when run the same job, time that spark used is very diffrent from shark.

2014-03-07 Thread Mayur Rustagi
contains & returns data back to driver. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Mar 6, 2014 at 7:39 PM, qingyang li wrote: > *Hi, community, I have setup 3 nodes spark cluster using standalone m

Re: special case of custom partitioning

2014-03-06 Thread Mayur Rustagi
How about PartitionerAwareUnionRDD? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Mar 6, 2014 at 9:42 AM, Evan Chan wrote: > I would love to hear the answer to this as well. > >