[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37749785 For things like n-grams, isn't it okay to do them just per-partition and not worry about doing stuff across partitions? I agree that both this approach and the one in https

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37748284 @pwendell I was referring not to the actual implementation, but expectation when using the exposed API. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread pwendell
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37742731 @mridulm I think the RDD definition is actually `private[spark]` and it's just intended to be used internally for higher level algorithms. --- If your project is set up

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37736435 To a step back, given how niche this seems to be and how it violates the "usual" expectations of how our users use spark (lazy execution, etc as mentioned above) - d

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37735835 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13197/ --- If your project

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37735834 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37734634 It is hard to say what threshold to use. I couldn't think of a use case that requires a large window size, but I cannot say there is none. Another possible approach

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37734242 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have t

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37734241 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not hav

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37734195 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13195/ --- If your project

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37734193 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread pwendell
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37733908 Even if it's private we can end up with cases where users have a e.g. 10,000 partition RDD with only a few items in each partition. Do we know a priori when calling this

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread pwendell
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37733845 Ah I see - so this isn't going to be externally a user-visible class (I didn't notice it was `private[spark]`)? Would it make sense to throw an assertion error if the sli

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37732906 @pwendell , the limit case is not a practical example. In that case, we need re-partition for most operations to be efficient. Also, this is really for small window sizes l

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/136#discussion_r10635646 --- Diff: core/src/main/scala/org/apache/spark/rdd/SlidedRDD.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under o

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/136#discussion_r10635644 --- Diff: core/src/main/scala/org/apache/spark/rdd/SlidedRDD.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under o

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37732586 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not hav

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37732587 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have t

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread pwendell
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37732233 I don't think we typically run jobs inside of getPartitions - so this changes some semantics of calling that function. For instance a lot of the other RDD constructors im

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/136#discussion_r10635557 --- Diff: core/src/main/scala/org/apache/spark/rdd/SlidedRDD.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under o

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread pwendell
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/136#discussion_r10635447 --- Diff: core/src/main/scala/org/apache/spark/rdd/SlidedRDD.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread pwendell
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/136#discussion_r10635444 --- Diff: core/src/main/scala/org/apache/spark/rdd/SlidedRDD.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-15 Thread pwendell
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/136#discussion_r10635413 --- Diff: core/src/main/scala/org/apache/spark/rdd/SlidedRDD.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37583751 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37583753 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13166/ --- If your project

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37577529 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have t

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-37577527 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not hav

[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-13 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/136 [SPARK-1241] Add sliding to RDD Sliding is useful for operations like creating n-grams, calculating total variation, numerical integration, etc. This is similar to https://github.com/apache/incubator