Mike,
I believe the reason you're seeing near identical performance on the
gradient computations is twofold
1) Gradient computations for GLM models are computationally pretty cheap
from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call
in the dense case, which is well known to
Hello Devs,
This email concerns some timing results for a treeAggregate in
computing a (stochastic) gradient over an RDD of labelled points, as
is currently done in the MLlib optimization routine for SGD.
In SGD, the underlying RDD is downsampled by a fraction f \in (0,1],
and the subgradients ov
Juan, I wouldn't go as far as suggesting we switch from programming using
RDDs to using SparkIterable. For example, all methods involving context,
jobs or partitions should only be part of the RDD API and not part of
SparkIterable. That said, the Spark community would benefit from a
consistent set
Hi Sim,
I understand that what you propose is defining a trait SparkIterable (and
also PairSparkIterable for RDDs of pairs) that encapsulates the methods in
RDDs, and then program using that trait instead of RDD. That is similar to
programming using scala.collection.GenSeq to abstract from using a
@debasish83, yes, there are many ways to optimize and work around the
limitation of no nested RDDs. The point of this thread is to discuss the API
patterns of Spark in order to make the platform more accessible to lots of
developers solving interesting problems quickly. We can get API consistency
w
Robin, my point exactly. When an API is valuable, let's expose it in a way
that it may be used easily for all data Spark touches. It should not require
much development work to implement the sampling logic to work for an
Iterable as opposed to an RDD.
--
View this message in context:
http://apa
Juan, thanks for sharing this. I am facing what looks like a similar issue
having to do with variable grouped upsampling (sampling some groups at
different rates, sometimes > 100%). I will study the approach you took.
As for the topic of this thread, I think it is important to separate two
issues:
Aniket, yes, I've done the separate file trick. :) Still, I think we can
solve this problem without nested RDDs.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14192.html
Sent from the Apache Spark Developers List mailing lis
Thanks everyone for the comments! I waited for more replies to come before I
responded as I was interested in the community's opinion.
The thread I'm noticing in this thread (pun intended) is that most responses
focus on the nested RDD issue. I think we all agree that it is problematic
for many r
Rdd nesting can lead to recursive nesting...i would like to know the
usecase and why join can't support it...you can always expose an api over a
rdd and access that in another rdd mappartition...use a external data
source like hbase cassandra redis to support the api...
For ur case group by and th
I'm not sure the problem is quite as bad as you state. Both sampleByKey and
sampleByKeyExact are implemented using a function from
StratifiedSamplingUtils which does one of two things depending on whether
the exact implementation is needed. The exact version requires double the
number of lines of c
he-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21
I agree that this in issue but I am afraid supporting RDD nesting would be
hard and perhaps would need rearchitecting Spark. For now, you may to use
workarounds like storing each group in a separate file, process each file
as separate RDD and finally merge results in a single RDD.
I know its painf
I'm not sure what we can do here. Nested RDDs are a pain to implement,
support, and explain. The programming model is not well explored.
Maybe a UDAF interface that allows going through the data twice?
On Mon, Sep 14, 2015 at 4:36 PM, sim wrote:
> I'd like to get some feedback on an API design
14 matches
Mail list logo