Re: One element per node

2015-09-18 Thread Reynold Xin
The reason it is nondeterministic is because tasks are not always scheduled to the same nodes -- so I don't think you can make this deterministic. If you assume no failures and tasks take a while to run (so it runs slower than the scheduler can schedule them), then I think you can make it determin

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
Sounds interesting! Is it possible to make it deterministic by using global long value and get the element on partition only if someFunction(partitionId, globalLong)==true? Or by using some specific partitioner that creates such partitionIds that can be decomposed into nodeId and number of parti

Re: One element per node

2015-09-18 Thread Feynman Liang
AFAIK the physical distribution is not exposed in the public API; the closest I can think of is `rdd.coalesce(numPhysicalNodes).mapPartitions(...` but this assumes that one partition exists per node On Fri, Sep 18, 2015 at 4:09 PM, Ulanov, Alexander wrote: > Thank you! How can I guarantee that I

Re: One element per node

2015-09-18 Thread Reynold Xin
Use a global atomic boolean and return nothing from that partition if the boolean is true. Note that your result won't be deterministic. On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander wrote: Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical no

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical node)? From: Feynman Liang [mailto:fli...@databricks.com] Sent: Friday, September 18, 2015 4:06 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: One element per node rdd.mapPartiti

Re: One element per node

2015-09-18 Thread Feynman Liang
rdd.mapPartitions(x => new Iterator(x.head)) On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander wrote: > Dear Spark developers, > > > > Is it possible (and how to do it if possible) to pick one element per > physical node from an RDD? Let’s say the first element of any partition on > that node.

One element per node

2015-09-18 Thread Ulanov, Alexander
Dear Spark developers, Is it possible (and how to do it if possible) to pick one element per physical node from an RDD? Let's say the first element of any partition on that node. The result would be an RDD[element], the count of elements is equal to the N of nodes that has partitions of the ini

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen
Does anyone use ShuffleDependency directly in their Spark code or libraries? If so, how do you use it? Similarly, does anyone use ShuffleHandle

Re: And.eval short circuiting

2015-09-18 Thread Mingyu Kim
I filed SPARK-10703. Thanks! Mingyu From: Reynold Xin Date: Thursday, September 17, 2015 at 11:22 PM To: Mingyu Kim Cc: Zack Sampson, "dev@spark.apache.org", Peter Faiman, Matt Cheah, Michael Armbrust Subject: Re: And.eval short circuiting Please file a ticket and cc me. Thanks. On Thu,

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Reynold Xin
Sounds good. On Fri, Sep 18, 2015 at 8:50 AM, Shixiong Zhu wrote: > I'm wondering if we should create a tag trait (e.g., LocalMessage) for > messages like this and add the comment in the trait. Looks better than > adding inline comments for all these messages. > > Best Regards, > Shixiong Zhu >

Re: [MLlib] BinaryLogisticRegressionSummary on test set

2015-09-18 Thread Feynman Liang
If you have the time, submitting a PR for it would be awesome! However, our review bandwidth is limited so you should not expect it to get immediately reviewed. Let's continue discussion of the name on JIRA On Fri, Sep 18, 2015 at 2:47 AM, Hao Ren wrote: > Thank you for the reply. > > I have cre

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Shixiong Zhu
I'm wondering if we should create a tag trait (e.g., LocalMessage) for messages like this and add the comment in the trait. Looks better than adding inline comments for all these messages. Best Regards, Shixiong Zhu 2015-09-18 15:10 GMT+08:00 Reynold Xin : > Maybe we should add some inline comme

Re: RDD API patterns

2015-09-18 Thread sim
@debasish83, yes, there are many ways to optimize and work around the limitation of no nested RDDs. The point of this thread is to discuss the API patterns of Spark in order to make the platform more accessible to lots of developers solving interesting problems quickly. We can get API consistency w

Re: RDD API patterns

2015-09-18 Thread sim
Robin, my point exactly. When an API is valuable, let's expose it in a way that it may be used easily for all data Spark touches. It should not require much development work to implement the sampling logic to work for an Iterable as opposed to an RDD. -- View this message in context: http://apa

Re: RDD API patterns

2015-09-18 Thread sim
Juan, thanks for sharing this. I am facing what looks like a similar issue having to do with variable grouped upsampling (sampling some groups at different rates, sometimes > 100%). I will study the approach you took. As for the topic of this thread, I think it is important to separate two issues:

Re: RDD API patterns

2015-09-18 Thread sim
Aniket, yes, I've done the separate file trick. :) Still, I think we can solve this problem without nested RDDs. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14192.html Sent from the Apache Spark Developers List mailing lis

Re: RDD API patterns

2015-09-18 Thread sim
Thanks everyone for the comments! I waited for more replies to come before I responded as I was interested in the community's opinion. The thread I'm noticing in this thread (pun intended) is that most responses focus on the nested RDD issue. I think we all agree that it is problematic for many r

Re: [MLlib] BinaryLogisticRegressionSummary on test set

2015-09-18 Thread Hao Ren
Thank you for the reply. I have created a jira issue and pinged mengxr. Here is the link: https://issues.apache.org/jira/browse/SPARK-10691 I did not find jkbradley on jira. I saw he is on github. BTW, should I create a pull request on removing the private modifier for further discussion ? Thx

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Reynold Xin
Maybe we should add some inline comment explaining why it is ok for that message to be not serializable. On Thu, Sep 17, 2015 at 4:08 AM, Huangguowei wrote: > Thanks for your reply. I just want to do some monitors, never mind! > > > > *发件人:* Shixiong Zhu [mailto:zsxw...@gmail.com] > *发送时间:* 201