Another option could be to use a sketch to get approx median(extendable to
quantiles as well) for a large number of tasks sketch would give accurate
value as tasks are few, for larger task the benefit will be good.
Regards,
Mayur Rustagi
Ph: +1 (650) 937 9673
http://www.sigmoid.com <h
>
> We should take a vector instead giving the user flexibility to decide
> data source/ type
What do you mean by vector datatype exactly?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Wed, Nov 5,
hesh Kalakoti (Sigmoid Analytics)
Not to mention Spark & Pig communities.
Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
looking to use them as a local/distributed message bus
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Thu, Aug 21, 2014 at 4:04 AM, Debasish Das
wrote:
> Yeah that's the one we discussed...sorry I pointed
Interesting, clickstream data would have its own window concept based on
session of User , I can imagine windows would change across streams but
wouldnt they large be domain specific in Nature?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.
input partition across
nodes & scheduling preference of task related to unbalanced partition to
different nodes.. I am not sure if RDD can influence location of tasks
/partition location.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twit
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/EfQms8tK5cE
I suspect they are using thr own builds.. has anybody had a chance to look
at it?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
This would be really useful. Especially for Shark where shift of
partitioning effects all subsequent queries unless task scheduling time
beats spark.locality.wait. Can cause overall low performance for all
subsequent tasks.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
Do not call collect as that will perform materialization as well as
transfer of data to driver (might actually cause driver to fail if the data
is huge). You have to materialize the RDD in some way(call save, count,
collect).
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
We have a json feed of spark application interface that we use for easier
instrumentation & monitoring. Has that been considered/found relevant?
Already sent as a pull request to 0.9.0, would that work or should we
update it to 1.0.0?
Mayur Rustagi
Ph: +1 (760) 203 3257
usecase directly.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Tue, May 6, 2014 at 11:22 AM, prabeesh k wrote:
> Hi,
>
> I have seen three different ways to query data from Spark
>
>1. Default S
Huge joins would be interesting. I do all my demos on wikipedia dataset for
Shark. Joins are typical pain to showcase & show off :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Wed, Apr 23, 2014 at 10:33 AM,
I am creating one fully configured & synced one. But you still need to send
over configuration. Do you plan to use chef for that ?
On Apr 10, 2014 6:58 PM, "Jim Ancona" wrote:
> Are there scripts to build the AMI used by the spark-ec2 script?
>
> Alternatively, is there a place to download the A
copy paste?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Mon, Mar 10, 2014 at 12:30 PM, David Thomas wrote:
> Is there any guide available on creating a custom RDD?
>
contains & returns data
back to driver.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Thu, Mar 6, 2014 at 7:39 PM, qingyang li wrote:
> *Hi, community, I have setup 3 nodes spark cluster using standalone m
How about PartitionerAwareUnionRDD?
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Thu, Mar 6, 2014 at 9:42 AM, Evan Chan wrote:
> I would love to hear the answer to this as well.
>
>
16 matches
Mail list logo