Just to add to that, DStream.transform allows you do to arbitrary
RDD-to-RDD function. Inside that you can do iterative RDD operations as
well.
On Thu, Apr 2, 2015 at 6:27 AM, Sean Owen wrote:
> You can have diamonds but not cycles in the dependency graph.
>
> But what you are describing really
Yeah, thanks Alex!
On Thu, Apr 2, 2015 at 5:05 PM, Xiangrui Meng wrote:
> This is great! Thanks! -Xiangrui
>
> On Wed, Apr 1, 2015 at 12:11 PM, Ulanov, Alexander
> wrote:
> > FYI, I've added instructions to Netlib-java wiki, Sam added the link to
> them from the project's readme.md
> > https://
This is great! Thanks! -Xiangrui
On Wed, Apr 1, 2015 at 12:11 PM, Ulanov, Alexander
wrote:
> FYI, I've added instructions to Netlib-java wiki, Sam added the link to them
> from the project's readme.md
> https://github.com/fommil/netlib-java/wiki/NVBLAS
>
> Best regards, Alexander
> -Original
Hi Shivaram,
It sounds really interesting! With this time we can estimate if it worth
considering to run an iterative algorithm on Spark. For example, for SGD on
Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data
by one example not considering the data loading, comp
I haven't looked closely at the sampling issues, but regarding the
aggregation latency, there are fixed overheads (in local and distributed
mode) with the way aggregation is done in Spark. Launching a stage of
tasks, fetching outputs from the previous stage etc. all have overhead, so
I would say it
cool. FYI, i'm at databricks today and talked w/patrick, josh and davies
about this. we have some great ideas to actually make this happen and will
be pushing over the next few weeks to get it done. :)
On Thu, Apr 2, 2015 at 9:21 AM, Nicholas Chammas wrote:
> (Renaming thread so as to un-hija
When you say "It seems that instead of sample it is better to shuffle data
and then access it sequentially by mini-batches," are you sure that holds
true for a big dataset in a cluster? As far as implementing it, I haven't
looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
bu
On Thu, Apr 2, 2015 at 3:01 AM, Steve Loughran wrote:
>>> That would be really helpful to debug build failures. The scalatest
>>> output isn't all that helpful.
>>>
>
> Potentially an issue with the test runner, rather than the tests themselves.
Sorry, that was me over-generalizing. The output is
Hi Joseph,
Thank you for suggestion!
It seems that instead of sample it is better to shuffle data and then access it
sequentially by mini-batches. Could you suggest how to implement it?
With regards to aggregate (reduce), I am wondering why it works so slow in
local mode? Could you elaborate on
Incidentally, we were discussing this yesterday. Here are some thoughts on
null handling in SQL/DataFrames. Would be great to get some feedback.
1. Treat floating point NaN and null as the same "null" value. This would
be consistent with most SQL databases, and Pandas. This would also require
some
It looks like SPARK-3250 was applied to the sample() which GradientDescent
uses, and that should kick in for your minibatchFraction <= 0.4. Based on
your numbers, aggregation seems like the main issue, though I hesitate to
optimize aggregation based on local tests for data sizes that small.
The f
IMO, spark's config is kind of a mess right now. I completely agree with
Reynold that Spark's handling of config ought to be super-simple, its not
the kind of thing we want to put much effort in spark itself. It sounds so
trivial that everyone wants to redo it, but then all these additional
featu
(Renaming thread so as to un-hijack Marcelo's request.)
Sure, we definitely want tests running faster.
Part of "testing all the things" will be factoring out stuff from the
various builds that can be run just once.
We've also tried in the past (with little success) to parallelize test
execution
i agree with all of this. but can we please break up the tests and make
them shorter? :)
On Thu, Apr 2, 2015 at 8:54 AM, Nicholas Chammas wrote:
> This is secondary to Marcelo’s question, but I wanted to comment on this:
>
> Its main limitation is more cultural than technical: you need to get
This is secondary to Marcelo’s question, but I wanted to comment on this:
Its main limitation is more cultural than technical: you need to get people
to care about intermittent test runs, otherwise you can end up with
failures that nobody keeps on top of
This is a problem that plagues Spark as we
S3n is governed by the same config parameter.
Cheers
> On Apr 2, 2015, at 7:33 AM, Romi Kuntsman wrote:
>
> Hi Ted,
> Not sure what's the config value, I'm using s3n filesystem and not s3.
>
> The error that I get is the following:
> (so does that mean it's 4 retries?)
>
> Caused by: org.
Hi Ted,
Not sure what's the config value, I'm using s3n filesystem and not s3.
The error that I get is the following:
(so does that mean it's 4 retries?)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost ta
You can have diamonds but not cycles in the dependency graph.
But what you are describing really sounds like simple iteration, since
presumably you mean that the state of each element in the 'cycle'
changes each time, and so isn't really the same element each time, and
eventually you decide to sto
I'm afraid you're a little stuck. In Scala, the types Int, Long, Float,
Double, Byte, and Boolean look like reference types in source code, but
they are compiled to the corresponding JVM primitive types, which can't be
null. That's why you get the warning about ==.
It might be your best choice is
Take a look at the maven-shade-plugin in pom.xml.
Here is the snippet for org.spark-project.jetty :
org.eclipse.jetty
org.spark-project.jetty
org.eclipse.jetty.**
On Thu, Apr 2, 2015 at 3:59 AM, Ni
Hi i need to implement MeanImputor - impute missing values with mean. If
i set missing values to null - then dataframe aggregation works
properly, but in UDF it treats null values to 0.0. Here’s example:
|val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
df.agg(avg("_1")).firs
I didn't find any documentation regarding support for cycles in spark
topology , although storm supports this using manual configuration in
acker function logic (setting it to a particular count) .By cycles i
doesn't mean infinite loops .
Can any body please help me in that .
--
Thanks &
Hi,
I am looking for the org.spark-project.jetty and org.spark-project.guava
repo locations but I'm unable to find it in the maven repository.
are these publicly available?
rgds
--
Niranda
> On 2 Apr 2015, at 06:31, Patrick Wendell wrote:
>
> Hey Marcelo,
>
> Great question. Right now, some of the more active developers have an
> account that allows them to log into this cluster to inspect logs (we
> copy the logs from each run to a node on that cluster). The
> infrastructure is
24 matches
Mail list logo