Re: RDD API patterns

2015-09-26 Thread Evan R. Sparks
Mike, I believe the reason you're seeing near identical performance on the gradient computations is twofold 1) Gradient computations for GLM models are computationally pretty cheap from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call in the dense case, which is well known to

Re: RDD API patterns

2015-09-26 Thread Mike Hynes
Hello Devs, This email concerns some timing results for a treeAggregate in computing a (stochastic) gradient over an RDD of labelled points, as is currently done in the MLlib optimization routine for SGD. In SGD, the underlying RDD is downsampled by a fraction f \in (0,1], and the subgradients ov

Re: RDD API patterns

2015-09-19 Thread sim
Juan, I wouldn't go as far as suggesting we switch from programming using RDDs to using SparkIterable. For example, all methods involving context, jobs or partitions should only be part of the RDD API and not part of SparkIterable. That said, the Spark community would benefit from a consistent set

Re: RDD API patterns

2015-09-19 Thread Juan Rodríguez Hortalá
Hi Sim, I understand that what you propose is defining a trait SparkIterable (and also PairSparkIterable for RDDs of pairs) that encapsulates the methods in RDDs, and then program using that trait instead of RDD. That is similar to programming using scala.collection.GenSeq to abstract from using a

Re: RDD API patterns

2015-09-18 Thread sim
@debasish83, yes, there are many ways to optimize and work around the limitation of no nested RDDs. The point of this thread is to discuss the API patterns of Spark in order to make the platform more accessible to lots of developers solving interesting problems quickly. We can get API consistency w

Re: RDD API patterns

2015-09-18 Thread sim
Robin, my point exactly. When an API is valuable, let's expose it in a way that it may be used easily for all data Spark touches. It should not require much development work to implement the sampling logic to work for an Iterable as opposed to an RDD. -- View this message in context: http://apa

Re: RDD API patterns

2015-09-18 Thread sim
Juan, thanks for sharing this. I am facing what looks like a similar issue having to do with variable grouped upsampling (sampling some groups at different rates, sometimes > 100%). I will study the approach you took. As for the topic of this thread, I think it is important to separate two issues:

Re: RDD API patterns

2015-09-18 Thread sim
Aniket, yes, I've done the separate file trick. :) Still, I think we can solve this problem without nested RDDs. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14192.html Sent from the Apache Spark Developers List mailing lis

Re: RDD API patterns

2015-09-18 Thread sim
Thanks everyone for the comments! I waited for more replies to come before I responded as I was interested in the community's opinion. The thread I'm noticing in this thread (pun intended) is that most responses focus on the nested RDD issue. I think we all agree that it is problematic for many r

Re: RDD API patterns

2015-09-17 Thread Debasish Das
Rdd nesting can lead to recursive nesting...i would like to know the usecase and why join can't support it...you can always expose an api over a rdd and access that in another rdd mappartition...use a external data source like hbase cassandra redis to support the api... For ur case group by and th

Re: RDD API patterns

2015-09-16 Thread robineast
I'm not sure the problem is quite as bad as you state. Both sampleByKey and sampleByKeyExact are implemented using a function from StratifiedSamplingUtils which does one of two things depending on whether the exact implementation is needed. The exact version requires double the number of lines of c

Re: RDD API patterns

2015-09-16 Thread Juan Rodríguez Hortalá
he-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21

Re: RDD API patterns

2015-09-16 Thread Aniket
I agree that this in issue but I am afraid supporting RDD nesting would be hard and perhaps would need rearchitecting Spark. For now, you may to use workarounds like storing each group in a separate file, process each file as separate RDD and finally merge results in a single RDD. I know its painf

Re: RDD API patterns

2015-09-16 Thread Reynold Xin
I'm not sure what we can do here. Nested RDDs are a pain to implement, support, and explain. The programming model is not well explored. Maybe a UDAF interface that allows going through the data twice? On Mon, Sep 14, 2015 at 4:36 PM, sim wrote: > I'd like to get some feedback on an API design