[GitHub] spark pull request: Initialized the regVal for first iteration in ...

2014-02-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/40#discussion_r10188555
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala 
---
@@ -149,7 +149,14 @@ object GradientDescent extends Logging {
 
 // Initialize weights as a column vector
 var weights = new DoubleMatrix(initialWeights.length, 1, 
initialWeights:_*)
-var regVal = 0.0
+
+/**
+ * For the first iteration, the regVal will be initialized as sum of 
sqrt of
+ * weights if it's L2 update; for L1 update; the same logic is 
followed.
+ */
+var regVal = updater.compute(
--- End diff --

just a nit style pick here since @mengxr asked me to chime in.

it would be better if you just put weights and the rest on the same line, 
e.g.

```scala
var regVal = updater.compute(
  weights, new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Initialized the regVal for first iteration in ...

2014-02-28 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/40#issuecomment-36413254
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Initialized the regVal for first iteration in ...

2014-02-28 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/40#issuecomment-36413377
  
It's not you. There was somehow two Jenkins pull request builder setup ... 
I Just removed one of them. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Initialized the regVal for first iteration in ...

2014-02-28 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/40#issuecomment-36413420
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Initialized the regVal for first iteration in ...

2014-02-28 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/40#issuecomment-36413838
  
It is running now. Let's wait for Jenkins to come back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Remove remaining references to incubation

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/51#issuecomment-36448947
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: fix #SPARK-1149 Bad partitioners can cause Spa...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/44#issuecomment-36449005
  
Hi guys,

I think it is better to make sure Spark doesn't hang when an incorrect 
partition index is given, because there will be other code paths to run a job.  
Given the two places @CodingCat found, I think it shouldn't be too hard to fix 
those. @witgo do you mind doing that instead?

One thing for sure is we shouldn't add a check per key -- that can be too 
expensive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/41#issuecomment-36449019
  
Thanks. I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/41#issuecomment-36449064
  
Actually I'm having trouble merging this. I think it's because your git 
commit doesn't actually have any author information. Do you mind fixing that?

cc @pwendell here since he wrote that merge script
    ```
rxin @ rxin-mbp : /scratch/rxin/prs/spark-dev-tools/dev 
> ./merge_spark_pr.py 
Which pull request would you like to merge? (e.g. 34): 41

=== Pull Request #41 ===
title   Update io.netty from 4.0.13 Final to 4.0.17.Final
source  ngbinh/master
target  master
url https://api.github.com/repos/apache/spark/pulls/41

Proceed with merging pull request #41? (y/n): y
Switched to branch 'PR_TOOL_MERGE_PR_41_MASTER'
fatal: No existing author found with '""'
Traceback (most recent call last):
  File "./merge_spark_pr.py", line 197, in 
merge_hash = merge_pr(pr_num, target_ref)
  File "./merge_spark_pr.py", line 114, in merge_pr
run_cmd(['git', 'commit', '--author="%s"' % primary_author] + 
merge_message_flags)
  File "./merge_spark_pr.py", line 61, in run_cmd
return subprocess.check_output(cmd)
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",
 line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['git', 'commit', '--author=""', 
'-m', u'Update io.netty from 4.0.13 Final to 4.0.17.Final', '-m', u'This update 
contains a lot of bug fixes and some new perf improvements.\r\nIt is also 
binary compatible with the current 4.0.13.Final\r\n\r\nFor more information: 
http://netty.io/news/2014/02/25/4-0-17-Final.html', '-m', 'Author: ', '-m', 
u'Closes #41 from ngbinh/master and squashes the following commits:', '-m', 
'']' returned non-zero exit status 128
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Updated more links in documentation

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/23#issuecomment-36449072
  
#2 was actually merged. @jyotiska  do you mind closing this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1137: Make ZK PersistenceEngine not cras...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4#issuecomment-36449083
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10#issuecomment-36449099
  
@ankurdave 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Initialized the regVal for first iteration in ...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/40#issuecomment-36449165
  
Thanks. I've merged this.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Merge the old sbt-launch-lib.bash with the new...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/14#issuecomment-36449214
  
I've merged this. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/41#issuecomment-36449234
  
Sorry @ngbinh you misunderstood me. I think the problem is the git commit 
metadata doesn't actually contain the author information. It could be that the 
email or the author info in the git commit itself is not set (it's not about 
the commit message). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Initialized the regVal for first iteration in ...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/40#issuecomment-36449243
  
I think the asf git bot will close this once  the change is sync-ed on 
github. If it doesn't get closed tomorrow morning, please close this manually. 
Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/41#issuecomment-36449397
  
Thanks. I've merged this!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1137: Make ZK PersistenceEngine not cras...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4#issuecomment-36449554
  
Thanks. I've merged this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Remove remaining references to incubation

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/51#issuecomment-36449552
  
Thanks @pwendell I merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Ignore RateLimitedOutputStreamSuite for now.

2014-03-02 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/54

Ignore RateLimitedOutputStreamSuite for now.

This test has been flaky. We can re-enable it after @tdas has a chance to 
look at it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark ratelimit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/54.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #54


commit 1a121989986cecb8222de428c61b632ff53ebe30
Author: Reynold Xin 
Date:   2014-03-02T21:15:31Z

Ignore RateLimitedOutputStreamSuite for now.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...

2014-03-02 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/55

SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.

There was actually a problem with the RateLimitedOutputStream 
implementation where the first second doesn't write anything because of integer 
rounding. 

So RateLimitedOutputStream was overly aggressive in throttling.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark ratelimitest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/55.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #55


commit 52ce1b7dbc4a05e189b0b45dfe2ba902241e8d4b
Author: Reynold Xin 
Date:   2014-03-03T00:27:41Z

SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.

There was actually a problem with the RateLimitedOutputStream 
implementation where the first second doesn't output anything because of 
integer rounding.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/55#issuecomment-36473419
  
@tdas @pwendell


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/55#issuecomment-36473499
  
Also I think we should reduce the time to run this 4 sec test, but that's 
for another PR ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1173. Improve scala streaming docs.

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/64#issuecomment-36486791
  
Thanks Aaron. I've merged this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Proposal] SPARK-1171: simplify the implementa...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/63#issuecomment-36486838
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1173. Improve scala streaming docs.

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/64#issuecomment-36486817
  
There's also a typo in the Java version of the doc. If you don't mind 
fixing that as well ... :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/55#issuecomment-36487061
  
Also @ryanlecompte since you changed the implementation to tail recursion. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1173. Improve scala streaming docs.

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/64#issuecomment-36487507
  
Actually you will need to submit another PR. I've already merged this one 
(but github is laggy because it is waiting for the asf git bot to synchronize). 
Sorry about the confusion!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1173. (#2) Fix typo in Java streaming ex...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/65#issuecomment-36487753
  
Thanks for doing this.

BTW one thing I noticed is that your git commit's email is different from 
the ones you registered on github, so your commits don't actually show up as 
yours. You might want to fix that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1173. (#2) Fix typo in Java streaming ex...

2014-03-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/65#issuecomment-36487798
  
I merged this one too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fixed API docs link in Python programming guid...

2014-03-03 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/67#issuecomment-36562029
  
Hi @jyotiska 

These docs are not meant to be consumed directly as markdown files. They 
are meant to be generated using jekyll (run jekyll build in docs folder), and 
they will create the documentation here: 
http://spark.incubator.apache.org/docs/latest/python-programming-guide.html

Changing the extension to .md breaks all the links. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Removed accidentally checked in comment

2014-03-03 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/61#issuecomment-36569896
  
I merged this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: update proportion of memory

2014-03-03 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/66#issuecomment-36570035
  
Thanks. I've merged this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1164 Deprecated reduceByKeyToDriver as i...

2014-03-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/72#issuecomment-36656843
  
Thanks. Merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1178: missing document of spark.schedule...

2014-03-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/74#issuecomment-36656962
  
Thanks. Merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-782. Shade ASM

2014-03-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/90#discussion_r10332129
  
--- Diff: graphx/pom.xml ---
@@ -70,6 +70,10 @@
   scalacheck_${scala.binary.version}
   test
 
+
--- End diff --

Yes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: GRAPH-1: Map side distinct in collect vertex i...

2014-03-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/21#issuecomment-36949572
  
Actually - no ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: GRAPH-1: Map side distinct in collect vertex i...

2014-03-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/21#issuecomment-36949585
  
We should use the primitive hashmap - otherwise it is pretty slow 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Allow sbt to use more than 1G of heap.

2014-03-07 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/103

Allow sbt to use more than 1G of heap.

There was a mistake in sbt build file ( introduced by 
012bd5fbc97dc40bb61e0e2b9cc97ed0083f37f6 ) in which we set the default to 2048 
and the immediately reset it to 1024.

Without this, building Spark can run out of permgen space on my machine. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark sbt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #103


commit 8829c34517f4959e6899125747ac6160bd2c46a6
Author: Reynold Xin 
Date:   2014-03-08T06:46:11Z

Allow sbt to use more than 1G of heap.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Allow sbt to use more than 1G of heap.

2014-03-07 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/103#issuecomment-37091459
  
Ok I merged this. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update junitxml plugin to the latest version t...

2014-03-08 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/104

Update junitxml plugin to the latest version to avoid recompilation in 
every SBT command.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark junitxml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #104


commit 67ef7bffd92a30b8d81c072ad1c504eb3a53d264
Author: Reynold Xin 
Date:   2014-03-08T09:41:06Z

Update junitxml plugin to the latest version to avoid recompilation in 
every SBT command.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update junitxml plugin to the latest version t...

2014-03-08 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/104#issuecomment-37109170
  
Ok I merged this.

Not sure about Maven off the top of my head. All these build plugins are 
pretty arcane to me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Upgrade Jetty to 9.1.3.v20140225.

2014-03-09 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/113

Upgrade Jetty to 9.1.3.v20140225.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark jetty9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #113


commit 16ab3f3551d5181e185c503b954774e826c31fe0
Author: Reynold Xin 
Date:   2014-03-10T04:32:50Z

Upgrade Jetty to 9.1.3.v20140225.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: maintain arbitrary state data for each key

2014-03-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/114#issuecomment-37154805
  
Thanks. Merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Upgrade Jetty to 9.1.3.v20140225.

2014-03-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/113#discussion_r10420573
  
--- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala ---
@@ -120,26 +120,25 @@ private[spark] object JettyUtils extends Logging {
 
   private def addFilters(handlers: Seq[ServletContextHandler], conf: 
SparkConf) {
 val filters: Array[String] = conf.get("spark.ui.filters", 
"").split(',').map(_.trim())
-filters.foreach {
-  case filter : String => 
-if (!filter.isEmpty) {
-  logInfo("Adding filter: " + filter)
-  val holder : FilterHolder = new FilterHolder()
-  holder.setClassName(filter)
-  // get any parameters for each filter
-  val paramName = "spark." + filter + ".params"
-  val params = conf.get(paramName, 
"").split(',').map(_.trim()).toSet
-  params.foreach {
-case param : String =>
-  if (!param.isEmpty) {
-val parts = param.split("=")
-if (parts.length == 2) holder.setInitParameter(parts(0), 
parts(1))
- }
-  }
-  val enumDispatcher = java.util.EnumSet.of(DispatcherType.ASYNC, 
DispatcherType.ERROR, 
-DispatcherType.FORWARD, DispatcherType.INCLUDE, 
DispatcherType.REQUEST)
-  handlers.foreach { case(handler) => handler.addFilter(holder, 
"/*", enumDispatcher) }
+filters.foreach { filter =>
--- End diff --

Here it's only a coding style change for the block. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Upgrade Jetty to 9.1.3.v20140225.

2014-03-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37158460
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.

2014-03-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37212561
  
Actually I was waiting for the pull request builder to come back before 
asking you to verify it. @tgravescs do you mind verifying this works for you? I 
haven't updated the Maven files yet. Let me know if you only build with Maven. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.

2014-03-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37213133
  
The main reason is that we are investigating upgrading some of the major 
dependencies prior to 1.0, after which we won't be able to upgrade for a while. 
Some users have requested newer versions of Jetty that conflicts with the very 
old version we ship.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.

2014-03-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37230554
  
Yea unfortunately that's been the case for the past few days  We should 
probably have another external repo to host artifacts only on cloudera repo to 
have some redundancy. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.

2014-03-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37249125
  
Ok I think the cloduera repo is up. @tgravescs it would be great if you can 
try this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.

2014-03-11 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37377281
  
That sounds good. @pwendell should make the call here ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.

2014-03-12 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37482277
  
Ok I pushed a new version with Maven build changes as well. This is ready 
to be merged from my perspective. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.

2014-03-12 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/113#issuecomment-37483693
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/126#discussion_r10552638
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/TimeStampedWeakValueHashMap.scala ---
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import scala.collection.{JavaConversions, immutable}
+
+import java.util
+import java.lang.ref.WeakReference
+import java.util.concurrent.ConcurrentHashMap
+
+import org.apache.spark.Logging
+
+private[util] case class TimeStampedWeakValue[T](timestamp: Long, 
weakValue: WeakReference[T]) {
+  def this(timestamp: Long, value: T) = this(timestamp, new 
WeakReference[T](value))
+}
+
+/**
+ * A map that stores the timestamp of when a key was inserted along with 
the value,
+ * while ensuring that the values are weakly referenced. If the value is 
garbage collected and
+ * the weak reference is null, get() operation returns the key be 
non-existent. However,
+ * the key is actually not removed in the current implementation. 
Key-value pairs whose
--- End diff --

I haven't looked at the context, but can't you just store the key and the 
value together? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552654
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMessages.scala ---
@@ -35,9 +35,9 @@ private[storage] object BlockManagerMessages {
   case class RemoveRdd(rddId: Int) extends ToBlockManagerSlave
 
 
-  
//
+  // 

   // Messages from slaves to the master.
-  
//
+  // 

   sealed trait ToBlockManagerMaster
--- End diff --

It seems like it's better for us to either fix the style checker, or 
replace the // with something else. using a hint to turn the checker off for 
this comment block is pretty weird.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552655
  
--- Diff: core/src/test/scala/org/apache/spark/CheckpointSuite.scala ---
@@ -432,7 +432,7 @@ object CheckpointSuite {
   // This is a custom cogroup function that does not use mapValues like
   // the PairRDDFunctions.cogroup()
   def cogroup[K, V](first: RDD[(K, V)], second: RDD[(K, V)], part: 
Partitioner) = {
-//println("First = " + first + ", second = " + second)
+// println("First = " + first + ", second = " + second)
--- End diff --

Just remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552666
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala 
---
@@ -54,7 +54,7 @@ object SparkALS {
 for (i <- 0 until M; j <- 0 until U) {
   r.set(i, j, blas.ddot(ms(i), us(j)))
 }
-//println("R: " + r)
+// println("R: " + r)
--- End diff --

remove this line also


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552662
  
--- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala 
---
@@ -53,7 +53,7 @@ object LocalALS {
 for (i <- 0 until M; j <- 0 until U) {
   r.set(i, j, blas.ddot(ms(i), us(j)))
 }
-//println("R: " + r)
+// println("R: " + r)
--- End diff --

remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552671
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/SparkHdfsLR.scala ---
@@ -34,8 +34,8 @@ object SparkHdfsLR {
   case class DataPoint(x: Vector, y: Double)
 
   def parsePoint(line: String): DataPoint = {
-//val nums = line.split(' ').map(_.toDouble)
-//return DataPoint(new Vector(nums.slice(1, D+1)), nums(0))
+// val nums = line.split(' ').map(_.toDouble)
--- End diff --

remove these 2 lines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552685
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/NetworkInputDStream.scala
 ---
@@ -128,7 +128,7 @@ abstract class NetworkReceiver[T: ClassTag]() extends 
Serializable with Logging
 } catch {
   case ie: InterruptedException =>
 logInfo("Receiving thread interrupted")
-//println("Receiving thread interrupted")
+// println("Receiving thread interrupted")
--- End diff --

remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552681
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/impl/Serializers.scala ---
@@ -298,7 +298,7 @@ abstract class ShuffleSerializationStream(s: 
OutputStream) extends Serialization
 s.write(v.toInt)
   }
 
-  //def writeDouble(v: Double): Unit = 
writeUnsignedVarLong(java.lang.Double.doubleToLongBits(v))
+  // def writeDouble(v: Double): Unit = 
writeUnsignedVarLong(java.lang.Double.doubleToLongBits(v))
--- End diff --

remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552680
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/impl/Serializers.scala ---
@@ -391,7 +391,7 @@ abstract class ShuffleDeserializationStream(s: 
InputStream) extends Deserializat
   (s.read() & 0xFF)
   }
 
-  //def readDouble(): Double = 
java.lang.Double.longBitsToDouble(readUnsignedVarLong())
+  // def readDouble(): Double = 
java.lang.Double.longBitsToDouble(readUnsignedVarLong())
--- End diff --

remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552712
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala ---
@@ -137,7 +137,7 @@ trait ClientBase extends Logging {
 } else if (srcHost != null && dstHost == null) {
   return false
 }
-//check for ports
+// check for ports
--- End diff --

this comment is pretty much useless. let's remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552696
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/StateDStream.scala 
---
@@ -64,7 +64,7 @@ class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](
 }
 val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)
 val stateRDD = cogroupedRDD.mapPartitions(finalFunc, 
preservePartitioning)
-//logDebug("Generating state RDD for time " + validTime)
+// logDebug("Generating state RDD for time " + validTime)
--- End diff --

remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552705
  
--- Diff: 
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala ---
@@ -152,7 +152,7 @@ class InputStreamsSuite extends TestSuiteBase with 
BeforeAndAfter {
 // Set up the streaming context and input streams
 val ssc = new StreamingContext(conf, batchDuration)
 val networkStream = ssc.actorStream[String](Props(new 
TestActor(port)), "TestActor",
-  StorageLevel.MEMORY_AND_DISK) //Had to pass the local value of port 
to prevent from closing over entire scope
+  StorageLevel.MEMORY_AND_DISK) // Had to pass the local value of port 
to prevent from closing over entire scope
--- End diff --

this is going to be longer than 100 chars


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552723
  
--- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala ---
@@ -206,12 +206,12 @@ class SendingConnection(val address: 
InetSocketAddress, selector_ : Selector,
 
   private class Outbox {
 val messages = new Queue[Message]()
-val defaultChunkSize = 65536  //32768 //16384
+val defaultChunkSize = 65536  // 32768 // 16384
--- End diff --

remove the comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552716
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -278,7 +278,7 @@ private[spark] class Executor(
   // have left some weird state around depending on when the 
exception was thrown, but on
   // the other hand, maybe we could detect that when future tasks 
fail and exit then.
   logError("Exception in task ID " + taskId, t)
-  //System.exit(1)
+  // System.exit(1)
--- End diff --

remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/126#discussion_r10552834
  
--- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable.{ArrayBuffer, SynchronizedBuffer}
+
+import java.util.concurrent.{LinkedBlockingQueue, TimeUnit}
+
+/** Listener class used for testing when any item has been cleaned by the 
Cleaner class */
+private[spark] trait CleanerListener {
+  def rddCleaned(rddId: Int)
+  def shuffleCleaned(shuffleId: Int)
+}
+
+/**
+ * Cleans RDDs and shuffle data.
+ */
+private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
+
+  /** Classes to represent cleaning tasks */
+  private sealed trait CleaningTask
+  private case class CleanRDD(rddId: Int) extends CleaningTask
+  private case class CleanShuffle(shuffleId: Int) extends CleaningTask
+  // TODO: add CleanBroadcast
+
+  private val queue = new LinkedBlockingQueue[CleaningTask]
+
+  protected val listeners = new ArrayBuffer[CleanerListener]
+with SynchronizedBuffer[CleanerListener]
+
+  private val cleaningThread = new Thread() { override def run() { 
keepCleaning() }}
+
+  @volatile private var stopped = false
+
+  /** Start the cleaner */
+  def start() {
+cleaningThread.setDaemon(true)
+cleaningThread.start()
+  }
+
+  /** Stop the cleaner */
+  def stop() {
+stopped = true
+cleaningThread.interrupt()
+  }
+
+  /**
+   * Clean (unpersist) RDD data. Do not perform any time or resource 
intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  def cleanRDD(rddId: Int) {
+enqueue(CleanRDD(rddId))
+logDebug("Enqueued RDD " + rddId + " for cleaning up")
+  }
+
+  /**
+   * Clean shuffle data. Do not perform any time or resource intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  def cleanShuffle(shuffleId: Int) {
+enqueue(CleanShuffle(shuffleId))
+logDebug("Enqueued shuffle " + shuffleId + " for cleaning up")
+  }
+
+  /** Attach a listener object to get information of when objects are 
cleaned. */
+  def attachListener(listener: CleanerListener) {
+listeners += listener
+  }
+
+  /**
+   * Enqueue a cleaning task. Do not perform any time or resource intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  private def enqueue(task: CleaningTask) {
+queue.put(task)
+  }
+
+  /** Keep cleaning RDDs and shuffle data */
+  private def keepCleaning() {
+try {
+  while (!isStopped) {
+val taskOpt = Option(queue.poll(100, TimeUnit.MILLISECONDS))
+taskOpt.foreach(task => {
--- End diff --

a nit style thing - the spark style usually do
```scala
taskOpt.foreach { task =>
  ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/126#discussion_r10552839
  
--- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable.{ArrayBuffer, SynchronizedBuffer}
+
+import java.util.concurrent.{LinkedBlockingQueue, TimeUnit}
+
+/** Listener class used for testing when any item has been cleaned by the 
Cleaner class */
+private[spark] trait CleanerListener {
+  def rddCleaned(rddId: Int)
+  def shuffleCleaned(shuffleId: Int)
+}
+
+/**
+ * Cleans RDDs and shuffle data.
+ */
+private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
+
+  /** Classes to represent cleaning tasks */
+  private sealed trait CleaningTask
+  private case class CleanRDD(rddId: Int) extends CleaningTask
+  private case class CleanShuffle(shuffleId: Int) extends CleaningTask
+  // TODO: add CleanBroadcast
+
+  private val queue = new LinkedBlockingQueue[CleaningTask]
+
+  protected val listeners = new ArrayBuffer[CleanerListener]
+with SynchronizedBuffer[CleanerListener]
+
+  private val cleaningThread = new Thread() { override def run() { 
keepCleaning() }}
+
+  @volatile private var stopped = false
+
+  /** Start the cleaner */
+  def start() {
+cleaningThread.setDaemon(true)
+cleaningThread.start()
+  }
+
+  /** Stop the cleaner */
+  def stop() {
+stopped = true
+cleaningThread.interrupt()
+  }
+
+  /**
+   * Clean (unpersist) RDD data. Do not perform any time or resource 
intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  def cleanRDD(rddId: Int) {
+enqueue(CleanRDD(rddId))
+logDebug("Enqueued RDD " + rddId + " for cleaning up")
+  }
+
+  /**
+   * Clean shuffle data. Do not perform any time or resource intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  def cleanShuffle(shuffleId: Int) {
+enqueue(CleanShuffle(shuffleId))
+logDebug("Enqueued shuffle " + shuffleId + " for cleaning up")
+  }
+
+  /** Attach a listener object to get information of when objects are 
cleaned. */
+  def attachListener(listener: CleanerListener) {
+listeners += listener
+  }
+
+  /**
+   * Enqueue a cleaning task. Do not perform any time or resource intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  private def enqueue(task: CleaningTask) {
+queue.put(task)
+  }
+
+  /** Keep cleaning RDDs and shuffle data */
+  private def keepCleaning() {
+try {
+  while (!isStopped) {
+val taskOpt = Option(queue.poll(100, TimeUnit.MILLISECONDS))
+taskOpt.foreach(task => {
+  logDebug("Got cleaning task " + taskOpt.get)
+  task match {
+case CleanRDD(rddId) => doCleanRDD(sc, rddId)
+case CleanShuffle(shuffleId) => doCleanShuffle(shuffleId)
+  }
+})
+  }
+} catch {
+  case ie: InterruptedException =>
+if (!isStopped) logWarning("Cleaning thread interrupted")
--- End diff --

wrap the body in curly braces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/126#discussion_r10552846
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -20,15 +20,15 @@ package org.apache.spark
 import java.io._
 import java.util.zip.{GZIPInputStream, GZIPOutputStream}
 
-import scala.collection.mutable.HashSet
+import scala.Some
--- End diff --

remove sclaa.some


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10552859
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMessages.scala ---
@@ -35,9 +35,9 @@ private[storage] object BlockManagerMessages {
   case class RemoveRdd(rddId: Int) extends ToBlockManagerSlave
 
 
-  
//
+  // 

   // Messages from slaves to the master.
-  
//
+  // 

   sealed trait ToBlockManagerMaster
--- End diff --

It looks really strange to add a space there.

If you do

```scala
/* ---
```

that's probably slightly better ...

or can we modify the rule to not report an exception if it is used like 
this? (i.e. / after //)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/126#discussion_r10552941
  
--- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable.{ArrayBuffer, SynchronizedBuffer}
+
+import java.util.concurrent.{LinkedBlockingQueue, TimeUnit}
+
+/** Listener class used for testing when any item has been cleaned by the 
Cleaner class */
+private[spark] trait CleanerListener {
+  def rddCleaned(rddId: Int)
+  def shuffleCleaned(shuffleId: Int)
+}
+
+/**
+ * Cleans RDDs and shuffle data.
+ */
+private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
+
+  /** Classes to represent cleaning tasks */
+  private sealed trait CleaningTask
+  private case class CleanRDD(rddId: Int) extends CleaningTask
+  private case class CleanShuffle(shuffleId: Int) extends CleaningTask
+  // TODO: add CleanBroadcast
+
+  private val queue = new LinkedBlockingQueue[CleaningTask]
+
+  protected val listeners = new ArrayBuffer[CleanerListener]
+with SynchronizedBuffer[CleanerListener]
+
+  private val cleaningThread = new Thread() { override def run() { 
keepCleaning() }}
+
+  @volatile private var stopped = false
+
+  /** Start the cleaner */
+  def start() {
+cleaningThread.setDaemon(true)
+cleaningThread.start()
+  }
+
+  /** Stop the cleaner */
+  def stop() {
+stopped = true
+cleaningThread.interrupt()
+  }
+
+  /**
+   * Clean (unpersist) RDD data. Do not perform any time or resource 
intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  def cleanRDD(rddId: Int) {
+enqueue(CleanRDD(rddId))
+logDebug("Enqueued RDD " + rddId + " for cleaning up")
+  }
+
+  /**
+   * Clean shuffle data. Do not perform any time or resource intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  def cleanShuffle(shuffleId: Int) {
+enqueue(CleanShuffle(shuffleId))
+logDebug("Enqueued shuffle " + shuffleId + " for cleaning up")
+  }
+
+  /** Attach a listener object to get information of when objects are 
cleaned. */
+  def attachListener(listener: CleanerListener) {
+listeners += listener
+  }
+
+  /**
+   * Enqueue a cleaning task. Do not perform any time or resource intensive
+   * computation in this function as this is called from a finalize() 
function.
+   */
+  private def enqueue(task: CleaningTask) {
+queue.put(task)
+  }
+
+  /** Keep cleaning RDDs and shuffle data */
+  private def keepCleaning() {
+try {
+  while (!isStopped) {
+val taskOpt = Option(queue.poll(100, TimeUnit.MILLISECONDS))
+taskOpt.foreach(task => {
+  logDebug("Got cleaning task " + taskOpt.get)
+  task match {
+case CleanRDD(rddId) => doCleanRDD(sc, rddId)
+case CleanShuffle(shuffleId) => doCleanShuffle(shuffleId)
+  }
+})
+  }
+} catch {
+  case ie: InterruptedException =>
+if (!isStopped) logWarning("Cleaning thread interrupted")
--- End diff --

latter


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/126#discussion_r10552981
  
--- Diff: core/src/main/scala/org/apache/spark/util/BoundedHashMap.scala ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import scala.collection.mutable.{ArrayBuffer, SynchronizedMap}
+
+import java.util.{Collections, LinkedHashMap}
+import java.util.Map.{Entry => JMapEntry}
+import scala.reflect.ClassTag
+
+/**
+ * A map that upper bounds the number of key-value pairs present in it. It 
can be configured to
+ * drop the least recently user pair or the earliest inserted pair. It 
exposes a
+ * scala.collection.mutable.Map interface to allow it to be a drop-in 
replacement for Scala
+ * HashMaps.
+ *
+ * Internally, a Java LinkedHashMap is used to get insert-order or 
access-order behavior.
+ * Note that the LinkedHashMap is not thread-safe and hence, it is wrapped 
in a
+ * Collections.synchronizedMap. However, getting the Java HashMap's 
iterator and
+ * using it can still lead to ConcurrentModificationExceptions. Hence, the 
iterator()
+ * function is overridden to copy the all pairs into an ArrayBuffer and 
then return the
+ * iterator to the ArrayBuffer. Also, the class apply the trait 
SynchronizedMap which
+ * ensures that all calls to the Scala Map API are synchronized. This 
together ensures
+ * that ConcurrentModificationException is never thrown.
+ *
+ * @param bound   max number of key-value pairs
+ * @param useLRU  true = least recently used/accessed will be dropped when 
bound is reached,
+ *false = earliest inserted will be dropped
+ */
+private[spark] class BoundedHashMap[A, B](bound: Int, useLRU: Boolean)
+  extends WrappedJavaHashMap[A, B, A, B] with SynchronizedMap[A, B] {
+
+  protected[util] val internalJavaMap = Collections.synchronizedMap(new 
LinkedHashMap[A, B](
+bound / 8, (0.75).toFloat, useLRU) {
+override protected def removeEldestEntry(eldest: JMapEntry[A, B]): 
Boolean = {
+  size() > bound
+}
+  })
+
+  protected[util] def newInstance[K1, V1](): WrappedJavaHashMap[K1, V1, _, 
_] = {
--- End diff --

what does protected[util] mean in Scala?  You can access it in all child 
classes of all classes in util? Pretty confusing here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1096, a space after comment start style ...

2014-03-12 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/124#discussion_r10553024
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMessages.scala ---
@@ -35,9 +35,9 @@ private[storage] object BlockManagerMessages {
   case class RemoveRdd(rddId: Int) extends ToBlockManagerSlave
 
 
-  
//
+  // 

   // Messages from slaves to the master.
-  
//
+  // 

   sealed trait ToBlockManagerMaster
--- End diff --

I will let @pwendell decide that ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/126#issuecomment-37501259
  
@tdas I haven't finished looking at this (will probably spend more time 
after Fri) - but WrappedJavaHashMap is fairly complicated, and it seems like a 
recipe for complexity and potential performance disaster (due to clone & lots 
of locks).

How are its implementations used? Do you actually traverse an entire 
hashmap to find values? I am not sure if you want to do that in a cleanup. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...

2014-03-12 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/126#issuecomment-37501863
  
If you don't need high performance, why not just put a normal immutable 
hashmap so you don't have to worry about concurrency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1237, 1238] Improve the computation of ...

2014-03-13 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/131#issuecomment-37507055
  
Thanks. I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1240: handle the case of empty RDD when ...

2014-03-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/135#discussion_r10579726
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -310,6 +310,9 @@ abstract class RDD[T: ClassTag](
* Return a sampled subset of this RDD.
*/
   def sample(withReplacement: Boolean, fraction: Double, seed: Int): 
RDD[T] = {
+if (fraction < Double.MinValue  || fraction > Double.MaxValue) {
--- End diff --

Use require. i.e.
```scala
require(fraction > Double.MinValue && fraction < Double.MaxValue, "...")
```

Shouldn't you just check for fraction > 0 but < 1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1240: handle the case of empty RDD when ...

2014-03-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/135#discussion_r10581086
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -457,6 +457,10 @@ class RDDSuite extends FunSuite with 
SharedSparkContext {
 
   test("takeSample") {
 val data = sc.parallelize(1 to 100, 2)
+val emptySet = data.filter(_ => false)
--- End diff --

yup do

data.mapPartitions { iter => Iterator.empty }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: MLI-1 Decision Trees

2014-03-13 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-37607046
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix serialization of MutablePair. Also provide...

2014-03-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/141#issuecomment-37681568
  
Merge.d


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-03-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-37691023
  
I was thinking maybe we want a config option for this - which is on by 
default, but can be turned off. What do you guys think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix serialization of MutablePair. Also provide...

2014-03-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/141#issuecomment-37717648
  
I did : 
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=e19044cb1048c3755d1ea2cb43879d2225d49b54



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...

2014-03-15 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/149

SPARK-1255: Allow user to pass Serializer object instead of class name for 
shuffle.

This is more general than simply passing a string name and leaves more room 
for performance optimizations.

Note that this is technically an API breaking change - but I suspect nobody 
else in this world has used this API other than me in GraphX and Shark.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark serializer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/149.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #149


commit 7420185373ca90950af0f24831ad3ef10c097acc
Author: Reynold Xin 
Date:   2014-03-15T08:09:15Z

Allow user to pass Serializer object instead of class name for shuffle.

This is more general than simply passing a string name and leaves more room 
for performance optimizations.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...

2014-03-15 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/149#issuecomment-37720234
  
@marmbrus this is for you!
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...

2014-03-15 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/149#discussion_r10636356
  
--- Diff: core/src/main/scala/org/apache/spark/Dependency.scala ---
@@ -43,12 +44,13 @@ abstract class NarrowDependency[T](rdd: RDD[T]) extends 
Dependency(rdd) {
  * Represents a dependency on the output of a shuffle stage.
  * @param rdd the parent RDD
  * @param partitioner partitioner used to partition the shuffle output
- * @param serializerClass class name of the serializer to use
+ * @param serializer [[Serializer]] to use. If set to null, the default 
serializer, as specified
+ *  by `spark.serializer` config option, will be used.
  */
 class ShuffleDependency[K, V](
 @transient rdd: RDD[_ <: Product2[K, V]],
 val partitioner: Partitioner,
-val serializerClass: String = null)
+val serializer: Serializer = null)
--- End diff --

Compared with passing string, this adds about 58B to the task closure.  




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...

2014-03-15 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/149#discussion_r10636602
  
--- Diff: core/src/main/scala/org/apache/spark/Dependency.scala ---
@@ -43,12 +44,13 @@ abstract class NarrowDependency[T](rdd: RDD[T]) extends 
Dependency(rdd) {
  * Represents a dependency on the output of a shuffle stage.
  * @param rdd the parent RDD
  * @param partitioner partitioner used to partition the shuffle output
- * @param serializerClass class name of the serializer to use
+ * @param serializer [[Serializer]] to use. If set to null, the default 
serializer, as specified
+ *  by `spark.serializer` config option, will be used.
  */
 class ShuffleDependency[K, V](
 @transient rdd: RDD[_ <: Product2[K, V]],
 val partitioner: Partitioner,
-val serializerClass: String = null)
+val serializer: Serializer = null)
--- End diff --

No it wasn't. SparkConf was not serializable.

On Saturday, March 15, 2014, Mridul Muralidharan 
wrote:

> In core/src/main/scala/org/apache/spark/Dependency.scala:
>
> >   */
> >  class ShuffleDependency[K, V](
> >  @transient rdd: RDD[_ <: Product2[K, V]],
> >  val partitioner: Partitioner,
> > -val serializerClass: String = null)
> > +val serializer: Serializer = null)
>
> Is it 'cos the SparkConf is getting shipped around ?
> If yes, using the broadcast variable should alleviate it ?
>
> —
> Reply to this email directly or view it on 
GitHub<https://github.com/apache/spark/pull/149/files#r10636424>
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...

2014-03-15 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/149#discussion_r10637576
  
--- Diff: core/src/main/scala/org/apache/spark/Dependency.scala ---
@@ -43,12 +44,13 @@ abstract class NarrowDependency[T](rdd: RDD[T]) extends 
Dependency(rdd) {
  * Represents a dependency on the output of a shuffle stage.
  * @param rdd the parent RDD
  * @param partitioner partitioner used to partition the shuffle output
- * @param serializerClass class name of the serializer to use
+ * @param serializer [[Serializer]] to use. If set to null, the default 
serializer, as specified
+ *  by `spark.serializer` config option, will be used.
  */
 class ShuffleDependency[K, V](
 @transient rdd: RDD[_ <: Product2[K, V]],
 val partitioner: Partitioner,
-val serializerClass: String = null)
+val serializer: Serializer = null)
--- End diff --

It's only 58 bytes, not 58KB, which is less than 5% increase for 
serializing ShuffleDependency, which is part of any RDD task. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---