[GitHub] spark pull request: Initialized the regVal for first iteration in ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/40#discussion_r10188555 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala --- @@ -149,7 +149,14 @@ object GradientDescent extends Logging { // Initialize weights as a column vector var weights = new DoubleMatrix(initialWeights.length, 1, initialWeights:_*) -var regVal = 0.0 + +/** + * For the first iteration, the regVal will be initialized as sum of sqrt of + * weights if it's L2 update; for L1 update; the same logic is followed. + */ +var regVal = updater.compute( --- End diff -- just a nit style pick here since @mengxr asked me to chime in. it would be better if you just put weights and the rest on the same line, e.g. ```scala var regVal = updater.compute( weights, new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Initialized the regVal for first iteration in ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/40#issuecomment-36413254 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Initialized the regVal for first iteration in ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/40#issuecomment-36413377 It's not you. There was somehow two Jenkins pull request builder setup ... I Just removed one of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Initialized the regVal for first iteration in ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/40#issuecomment-36413420 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Initialized the regVal for first iteration in ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/40#issuecomment-36413838 It is running now. Let's wait for Jenkins to come back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove remaining references to incubation
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/51#issuecomment-36448947 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: fix #SPARK-1149 Bad partitioners can cause Spa...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/44#issuecomment-36449005 Hi guys, I think it is better to make sure Spark doesn't hang when an incorrect partition index is given, because there will be other code paths to run a job. Given the two places @CodingCat found, I think it shouldn't be too hard to fix those. @witgo do you mind doing that instead? One thing for sure is we shouldn't add a check per key -- that can be too expensive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/41#issuecomment-36449019 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/41#issuecomment-36449064 Actually I'm having trouble merging this. I think it's because your git commit doesn't actually have any author information. Do you mind fixing that? cc @pwendell here since he wrote that merge script ``` rxin @ rxin-mbp : /scratch/rxin/prs/spark-dev-tools/dev > ./merge_spark_pr.py Which pull request would you like to merge? (e.g. 34): 41 === Pull Request #41 === title Update io.netty from 4.0.13 Final to 4.0.17.Final source ngbinh/master target master url https://api.github.com/repos/apache/spark/pulls/41 Proceed with merging pull request #41? (y/n): y Switched to branch 'PR_TOOL_MERGE_PR_41_MASTER' fatal: No existing author found with '""' Traceback (most recent call last): File "./merge_spark_pr.py", line 197, in merge_hash = merge_pr(pr_num, target_ref) File "./merge_spark_pr.py", line 114, in merge_pr run_cmd(['git', 'commit', '--author="%s"' % primary_author] + merge_message_flags) File "./merge_spark_pr.py", line 61, in run_cmd return subprocess.check_output(cmd) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 575, in check_output raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command '['git', 'commit', '--author=""', '-m', u'Update io.netty from 4.0.13 Final to 4.0.17.Final', '-m', u'This update contains a lot of bug fixes and some new perf improvements.\r\nIt is also binary compatible with the current 4.0.13.Final\r\n\r\nFor more information: http://netty.io/news/2014/02/25/4-0-17-Final.html', '-m', 'Author: ', '-m', u'Closes #41 from ngbinh/master and squashes the following commits:', '-m', '']' returned non-zero exit status 128 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Updated more links in documentation
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/23#issuecomment-36449072 #2 was actually merged. @jyotiska do you mind closing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1137: Make ZK PersistenceEngine not cras...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4#issuecomment-36449083 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add Shortest-path computations to graphx.lib w...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10#issuecomment-36449099 @ankurdave --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Initialized the regVal for first iteration in ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/40#issuecomment-36449165 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Merge the old sbt-launch-lib.bash with the new...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/14#issuecomment-36449214 I've merged this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/41#issuecomment-36449234 Sorry @ngbinh you misunderstood me. I think the problem is the git commit metadata doesn't actually contain the author information. It could be that the email or the author info in the git commit itself is not set (it's not about the commit message). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Initialized the regVal for first iteration in ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/40#issuecomment-36449243 I think the asf git bot will close this once the change is sync-ed on github. If it doesn't get closed tomorrow morning, please close this manually. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update io.netty from 4.0.13 Final to 4.0.17.Fi...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/41#issuecomment-36449397 Thanks. I've merged this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1137: Make ZK PersistenceEngine not cras...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4#issuecomment-36449554 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove remaining references to incubation
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/51#issuecomment-36449552 Thanks @pwendell I merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Ignore RateLimitedOutputStreamSuite for now.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/54 Ignore RateLimitedOutputStreamSuite for now. This test has been flaky. We can re-enable it after @tdas has a chance to look at it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark ratelimit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/54.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #54 commit 1a121989986cecb8222de428c61b632ff53ebe30 Author: Reynold Xin Date: 2014-03-02T21:15:31Z Ignore RateLimitedOutputStreamSuite for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/55 SPARK-1158: Fix flaky RateLimitedOutputStreamSuite. There was actually a problem with the RateLimitedOutputStream implementation where the first second doesn't write anything because of integer rounding. So RateLimitedOutputStream was overly aggressive in throttling. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark ratelimitest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit 52ce1b7dbc4a05e189b0b45dfe2ba902241e8d4b Author: Reynold Xin Date: 2014-03-03T00:27:41Z SPARK-1158: Fix flaky RateLimitedOutputStreamSuite. There was actually a problem with the RateLimitedOutputStream implementation where the first second doesn't output anything because of integer rounding. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/55#issuecomment-36473419 @tdas @pwendell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/55#issuecomment-36473499 Also I think we should reduce the time to run this 4 sec test, but that's for another PR ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1173. Improve scala streaming docs.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/64#issuecomment-36486791 Thanks Aaron. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Proposal] SPARK-1171: simplify the implementa...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/63#issuecomment-36486838 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1173. Improve scala streaming docs.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/64#issuecomment-36486817 There's also a typo in the Java version of the doc. If you don't mind fixing that as well ... :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1158: Fix flaky RateLimitedOutputStreamS...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/55#issuecomment-36487061 Also @ryanlecompte since you changed the implementation to tail recursion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1173. Improve scala streaming docs.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/64#issuecomment-36487507 Actually you will need to submit another PR. I've already merged this one (but github is laggy because it is waiting for the asf git bot to synchronize). Sorry about the confusion! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1173. (#2) Fix typo in Java streaming ex...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/65#issuecomment-36487753 Thanks for doing this. BTW one thing I noticed is that your git commit's email is different from the ones you registered on github, so your commits don't actually show up as yours. You might want to fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1173. (#2) Fix typo in Java streaming ex...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/65#issuecomment-36487798 I merged this one too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fixed API docs link in Python programming guid...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/67#issuecomment-36562029 Hi @jyotiska These docs are not meant to be consumed directly as markdown files. They are meant to be generated using jekyll (run jekyll build in docs folder), and they will create the documentation here: http://spark.incubator.apache.org/docs/latest/python-programming-guide.html Changing the extension to .md breaks all the links. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Removed accidentally checked in comment
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/61#issuecomment-36569896 I merged this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: update proportion of memory
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/66#issuecomment-36570035 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1164 Deprecated reduceByKeyToDriver as i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/72#issuecomment-36656843 Thanks. Merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1178: missing document of spark.schedule...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/74#issuecomment-36656962 Thanks. Merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-782. Shade ASM
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/90#discussion_r10332129 --- Diff: graphx/pom.xml --- @@ -70,6 +70,10 @@ scalacheck_${scala.binary.version} test + --- End diff -- Yes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: GRAPH-1: Map side distinct in collect vertex i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/21#issuecomment-36949572 Actually - no ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: GRAPH-1: Map side distinct in collect vertex i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/21#issuecomment-36949585 We should use the primitive hashmap - otherwise it is pretty slow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Allow sbt to use more than 1G of heap.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/103 Allow sbt to use more than 1G of heap. There was a mistake in sbt build file ( introduced by 012bd5fbc97dc40bb61e0e2b9cc97ed0083f37f6 ) in which we set the default to 2048 and the immediately reset it to 1024. Without this, building Spark can run out of permgen space on my machine. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark sbt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/103.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #103 commit 8829c34517f4959e6899125747ac6160bd2c46a6 Author: Reynold Xin Date: 2014-03-08T06:46:11Z Allow sbt to use more than 1G of heap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Allow sbt to use more than 1G of heap.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/103#issuecomment-37091459 Ok I merged this. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update junitxml plugin to the latest version t...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/104 Update junitxml plugin to the latest version to avoid recompilation in every SBT command. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark junitxml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/104.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #104 commit 67ef7bffd92a30b8d81c072ad1c504eb3a53d264 Author: Reynold Xin Date: 2014-03-08T09:41:06Z Update junitxml plugin to the latest version to avoid recompilation in every SBT command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update junitxml plugin to the latest version t...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/104#issuecomment-37109170 Ok I merged this. Not sure about Maven off the top of my head. All these build plugins are pretty arcane to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Upgrade Jetty to 9.1.3.v20140225.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/113 Upgrade Jetty to 9.1.3.v20140225. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark jetty9 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #113 commit 16ab3f3551d5181e185c503b954774e826c31fe0 Author: Reynold Xin Date: 2014-03-10T04:32:50Z Upgrade Jetty to 9.1.3.v20140225. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: maintain arbitrary state data for each key
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/114#issuecomment-37154805 Thanks. Merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/113#discussion_r10420573 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -120,26 +120,25 @@ private[spark] object JettyUtils extends Logging { private def addFilters(handlers: Seq[ServletContextHandler], conf: SparkConf) { val filters: Array[String] = conf.get("spark.ui.filters", "").split(',').map(_.trim()) -filters.foreach { - case filter : String => -if (!filter.isEmpty) { - logInfo("Adding filter: " + filter) - val holder : FilterHolder = new FilterHolder() - holder.setClassName(filter) - // get any parameters for each filter - val paramName = "spark." + filter + ".params" - val params = conf.get(paramName, "").split(',').map(_.trim()).toSet - params.foreach { -case param : String => - if (!param.isEmpty) { -val parts = param.split("=") -if (parts.length == 2) holder.setInitParameter(parts(0), parts(1)) - } - } - val enumDispatcher = java.util.EnumSet.of(DispatcherType.ASYNC, DispatcherType.ERROR, -DispatcherType.FORWARD, DispatcherType.INCLUDE, DispatcherType.REQUEST) - handlers.foreach { case(handler) => handler.addFilter(holder, "/*", enumDispatcher) } +filters.foreach { filter => --- End diff -- Here it's only a coding style change for the block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37158460 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37212561 Actually I was waiting for the pull request builder to come back before asking you to verify it. @tgravescs do you mind verifying this works for you? I haven't updated the Maven files yet. Let me know if you only build with Maven. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37213133 The main reason is that we are investigating upgrading some of the major dependencies prior to 1.0, after which we won't be able to upgrade for a while. Some users have requested newer versions of Jetty that conflicts with the very old version we ship. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37230554 Yea unfortunately that's been the case for the past few days We should probably have another external repo to host artifacts only on cloudera repo to have some redundancy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37249125 Ok I think the cloduera repo is up. @tgravescs it would be great if you can try this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: WIP - Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37377281 That sounds good. @pwendell should make the call here ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37482277 Ok I pushed a new version with Maven build changes as well. This is ready to be merged from my perspective. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-37483693 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/126#discussion_r10552638 --- Diff: core/src/main/scala/org/apache/spark/util/TimeStampedWeakValueHashMap.scala --- @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import scala.collection.{JavaConversions, immutable} + +import java.util +import java.lang.ref.WeakReference +import java.util.concurrent.ConcurrentHashMap + +import org.apache.spark.Logging + +private[util] case class TimeStampedWeakValue[T](timestamp: Long, weakValue: WeakReference[T]) { + def this(timestamp: Long, value: T) = this(timestamp, new WeakReference[T](value)) +} + +/** + * A map that stores the timestamp of when a key was inserted along with the value, + * while ensuring that the values are weakly referenced. If the value is garbage collected and + * the weak reference is null, get() operation returns the key be non-existent. However, + * the key is actually not removed in the current implementation. Key-value pairs whose --- End diff -- I haven't looked at the context, but can't you just store the key and the value together? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552654 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMessages.scala --- @@ -35,9 +35,9 @@ private[storage] object BlockManagerMessages { case class RemoveRdd(rddId: Int) extends ToBlockManagerSlave - // + // // Messages from slaves to the master. - // + // sealed trait ToBlockManagerMaster --- End diff -- It seems like it's better for us to either fix the style checker, or replace the // with something else. using a hint to turn the checker off for this comment block is pretty weird. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552655 --- Diff: core/src/test/scala/org/apache/spark/CheckpointSuite.scala --- @@ -432,7 +432,7 @@ object CheckpointSuite { // This is a custom cogroup function that does not use mapValues like // the PairRDDFunctions.cogroup() def cogroup[K, V](first: RDD[(K, V)], second: RDD[(K, V)], part: Partitioner) = { -//println("First = " + first + ", second = " + second) +// println("First = " + first + ", second = " + second) --- End diff -- Just remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552666 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkALS.scala --- @@ -54,7 +54,7 @@ object SparkALS { for (i <- 0 until M; j <- 0 until U) { r.set(i, j, blas.ddot(ms(i), us(j))) } -//println("R: " + r) +// println("R: " + r) --- End diff -- remove this line also --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552662 --- Diff: examples/src/main/scala/org/apache/spark/examples/LocalALS.scala --- @@ -53,7 +53,7 @@ object LocalALS { for (i <- 0 until M; j <- 0 until U) { r.set(i, j, blas.ddot(ms(i), us(j))) } -//println("R: " + r) +// println("R: " + r) --- End diff -- remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552671 --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkHdfsLR.scala --- @@ -34,8 +34,8 @@ object SparkHdfsLR { case class DataPoint(x: Vector, y: Double) def parsePoint(line: String): DataPoint = { -//val nums = line.split(' ').map(_.toDouble) -//return DataPoint(new Vector(nums.slice(1, D+1)), nums(0)) +// val nums = line.split(' ').map(_.toDouble) --- End diff -- remove these 2 lines --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552685 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/NetworkInputDStream.scala --- @@ -128,7 +128,7 @@ abstract class NetworkReceiver[T: ClassTag]() extends Serializable with Logging } catch { case ie: InterruptedException => logInfo("Receiving thread interrupted") -//println("Receiving thread interrupted") +// println("Receiving thread interrupted") --- End diff -- remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552681 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/impl/Serializers.scala --- @@ -298,7 +298,7 @@ abstract class ShuffleSerializationStream(s: OutputStream) extends Serialization s.write(v.toInt) } - //def writeDouble(v: Double): Unit = writeUnsignedVarLong(java.lang.Double.doubleToLongBits(v)) + // def writeDouble(v: Double): Unit = writeUnsignedVarLong(java.lang.Double.doubleToLongBits(v)) --- End diff -- remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552680 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/impl/Serializers.scala --- @@ -391,7 +391,7 @@ abstract class ShuffleDeserializationStream(s: InputStream) extends Deserializat (s.read() & 0xFF) } - //def readDouble(): Double = java.lang.Double.longBitsToDouble(readUnsignedVarLong()) + // def readDouble(): Double = java.lang.Double.longBitsToDouble(readUnsignedVarLong()) --- End diff -- remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552712 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala --- @@ -137,7 +137,7 @@ trait ClientBase extends Logging { } else if (srcHost != null && dstHost == null) { return false } -//check for ports +// check for ports --- End diff -- this comment is pretty much useless. let's remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552696 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/StateDStream.scala --- @@ -64,7 +64,7 @@ class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag]( } val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner) val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning) -//logDebug("Generating state RDD for time " + validTime) +// logDebug("Generating state RDD for time " + validTime) --- End diff -- remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552705 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala --- @@ -152,7 +152,7 @@ class InputStreamsSuite extends TestSuiteBase with BeforeAndAfter { // Set up the streaming context and input streams val ssc = new StreamingContext(conf, batchDuration) val networkStream = ssc.actorStream[String](Props(new TestActor(port)), "TestActor", - StorageLevel.MEMORY_AND_DISK) //Had to pass the local value of port to prevent from closing over entire scope + StorageLevel.MEMORY_AND_DISK) // Had to pass the local value of port to prevent from closing over entire scope --- End diff -- this is going to be longer than 100 chars --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552723 --- Diff: core/src/main/scala/org/apache/spark/network/Connection.scala --- @@ -206,12 +206,12 @@ class SendingConnection(val address: InetSocketAddress, selector_ : Selector, private class Outbox { val messages = new Queue[Message]() -val defaultChunkSize = 65536 //32768 //16384 +val defaultChunkSize = 65536 // 32768 // 16384 --- End diff -- remove the comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552716 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -278,7 +278,7 @@ private[spark] class Executor( // have left some weird state around depending on when the exception was thrown, but on // the other hand, maybe we could detect that when future tasks fail and exit then. logError("Exception in task ID " + taskId, t) - //System.exit(1) + // System.exit(1) --- End diff -- remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/126#discussion_r10552834 --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.mutable.{ArrayBuffer, SynchronizedBuffer} + +import java.util.concurrent.{LinkedBlockingQueue, TimeUnit} + +/** Listener class used for testing when any item has been cleaned by the Cleaner class */ +private[spark] trait CleanerListener { + def rddCleaned(rddId: Int) + def shuffleCleaned(shuffleId: Int) +} + +/** + * Cleans RDDs and shuffle data. + */ +private[spark] class ContextCleaner(sc: SparkContext) extends Logging { + + /** Classes to represent cleaning tasks */ + private sealed trait CleaningTask + private case class CleanRDD(rddId: Int) extends CleaningTask + private case class CleanShuffle(shuffleId: Int) extends CleaningTask + // TODO: add CleanBroadcast + + private val queue = new LinkedBlockingQueue[CleaningTask] + + protected val listeners = new ArrayBuffer[CleanerListener] +with SynchronizedBuffer[CleanerListener] + + private val cleaningThread = new Thread() { override def run() { keepCleaning() }} + + @volatile private var stopped = false + + /** Start the cleaner */ + def start() { +cleaningThread.setDaemon(true) +cleaningThread.start() + } + + /** Stop the cleaner */ + def stop() { +stopped = true +cleaningThread.interrupt() + } + + /** + * Clean (unpersist) RDD data. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + def cleanRDD(rddId: Int) { +enqueue(CleanRDD(rddId)) +logDebug("Enqueued RDD " + rddId + " for cleaning up") + } + + /** + * Clean shuffle data. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + def cleanShuffle(shuffleId: Int) { +enqueue(CleanShuffle(shuffleId)) +logDebug("Enqueued shuffle " + shuffleId + " for cleaning up") + } + + /** Attach a listener object to get information of when objects are cleaned. */ + def attachListener(listener: CleanerListener) { +listeners += listener + } + + /** + * Enqueue a cleaning task. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + private def enqueue(task: CleaningTask) { +queue.put(task) + } + + /** Keep cleaning RDDs and shuffle data */ + private def keepCleaning() { +try { + while (!isStopped) { +val taskOpt = Option(queue.poll(100, TimeUnit.MILLISECONDS)) +taskOpt.foreach(task => { --- End diff -- a nit style thing - the spark style usually do ```scala taskOpt.foreach { task => ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/126#discussion_r10552839 --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.mutable.{ArrayBuffer, SynchronizedBuffer} + +import java.util.concurrent.{LinkedBlockingQueue, TimeUnit} + +/** Listener class used for testing when any item has been cleaned by the Cleaner class */ +private[spark] trait CleanerListener { + def rddCleaned(rddId: Int) + def shuffleCleaned(shuffleId: Int) +} + +/** + * Cleans RDDs and shuffle data. + */ +private[spark] class ContextCleaner(sc: SparkContext) extends Logging { + + /** Classes to represent cleaning tasks */ + private sealed trait CleaningTask + private case class CleanRDD(rddId: Int) extends CleaningTask + private case class CleanShuffle(shuffleId: Int) extends CleaningTask + // TODO: add CleanBroadcast + + private val queue = new LinkedBlockingQueue[CleaningTask] + + protected val listeners = new ArrayBuffer[CleanerListener] +with SynchronizedBuffer[CleanerListener] + + private val cleaningThread = new Thread() { override def run() { keepCleaning() }} + + @volatile private var stopped = false + + /** Start the cleaner */ + def start() { +cleaningThread.setDaemon(true) +cleaningThread.start() + } + + /** Stop the cleaner */ + def stop() { +stopped = true +cleaningThread.interrupt() + } + + /** + * Clean (unpersist) RDD data. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + def cleanRDD(rddId: Int) { +enqueue(CleanRDD(rddId)) +logDebug("Enqueued RDD " + rddId + " for cleaning up") + } + + /** + * Clean shuffle data. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + def cleanShuffle(shuffleId: Int) { +enqueue(CleanShuffle(shuffleId)) +logDebug("Enqueued shuffle " + shuffleId + " for cleaning up") + } + + /** Attach a listener object to get information of when objects are cleaned. */ + def attachListener(listener: CleanerListener) { +listeners += listener + } + + /** + * Enqueue a cleaning task. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + private def enqueue(task: CleaningTask) { +queue.put(task) + } + + /** Keep cleaning RDDs and shuffle data */ + private def keepCleaning() { +try { + while (!isStopped) { +val taskOpt = Option(queue.poll(100, TimeUnit.MILLISECONDS)) +taskOpt.foreach(task => { + logDebug("Got cleaning task " + taskOpt.get) + task match { +case CleanRDD(rddId) => doCleanRDD(sc, rddId) +case CleanShuffle(shuffleId) => doCleanShuffle(shuffleId) + } +}) + } +} catch { + case ie: InterruptedException => +if (!isStopped) logWarning("Cleaning thread interrupted") --- End diff -- wrap the body in curly braces --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/126#discussion_r10552846 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -20,15 +20,15 @@ package org.apache.spark import java.io._ import java.util.zip.{GZIPInputStream, GZIPOutputStream} -import scala.collection.mutable.HashSet +import scala.Some --- End diff -- remove sclaa.some --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10552859 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMessages.scala --- @@ -35,9 +35,9 @@ private[storage] object BlockManagerMessages { case class RemoveRdd(rddId: Int) extends ToBlockManagerSlave - // + // // Messages from slaves to the master. - // + // sealed trait ToBlockManagerMaster --- End diff -- It looks really strange to add a space there. If you do ```scala /* --- ``` that's probably slightly better ... or can we modify the rule to not report an exception if it is used like this? (i.e. / after //) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/126#discussion_r10552941 --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.mutable.{ArrayBuffer, SynchronizedBuffer} + +import java.util.concurrent.{LinkedBlockingQueue, TimeUnit} + +/** Listener class used for testing when any item has been cleaned by the Cleaner class */ +private[spark] trait CleanerListener { + def rddCleaned(rddId: Int) + def shuffleCleaned(shuffleId: Int) +} + +/** + * Cleans RDDs and shuffle data. + */ +private[spark] class ContextCleaner(sc: SparkContext) extends Logging { + + /** Classes to represent cleaning tasks */ + private sealed trait CleaningTask + private case class CleanRDD(rddId: Int) extends CleaningTask + private case class CleanShuffle(shuffleId: Int) extends CleaningTask + // TODO: add CleanBroadcast + + private val queue = new LinkedBlockingQueue[CleaningTask] + + protected val listeners = new ArrayBuffer[CleanerListener] +with SynchronizedBuffer[CleanerListener] + + private val cleaningThread = new Thread() { override def run() { keepCleaning() }} + + @volatile private var stopped = false + + /** Start the cleaner */ + def start() { +cleaningThread.setDaemon(true) +cleaningThread.start() + } + + /** Stop the cleaner */ + def stop() { +stopped = true +cleaningThread.interrupt() + } + + /** + * Clean (unpersist) RDD data. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + def cleanRDD(rddId: Int) { +enqueue(CleanRDD(rddId)) +logDebug("Enqueued RDD " + rddId + " for cleaning up") + } + + /** + * Clean shuffle data. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + def cleanShuffle(shuffleId: Int) { +enqueue(CleanShuffle(shuffleId)) +logDebug("Enqueued shuffle " + shuffleId + " for cleaning up") + } + + /** Attach a listener object to get information of when objects are cleaned. */ + def attachListener(listener: CleanerListener) { +listeners += listener + } + + /** + * Enqueue a cleaning task. Do not perform any time or resource intensive + * computation in this function as this is called from a finalize() function. + */ + private def enqueue(task: CleaningTask) { +queue.put(task) + } + + /** Keep cleaning RDDs and shuffle data */ + private def keepCleaning() { +try { + while (!isStopped) { +val taskOpt = Option(queue.poll(100, TimeUnit.MILLISECONDS)) +taskOpt.foreach(task => { + logDebug("Got cleaning task " + taskOpt.get) + task match { +case CleanRDD(rddId) => doCleanRDD(sc, rddId) +case CleanShuffle(shuffleId) => doCleanShuffle(shuffleId) + } +}) + } +} catch { + case ie: InterruptedException => +if (!isStopped) logWarning("Cleaning thread interrupted") --- End diff -- latter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/126#discussion_r10552981 --- Diff: core/src/main/scala/org/apache/spark/util/BoundedHashMap.scala --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import scala.collection.mutable.{ArrayBuffer, SynchronizedMap} + +import java.util.{Collections, LinkedHashMap} +import java.util.Map.{Entry => JMapEntry} +import scala.reflect.ClassTag + +/** + * A map that upper bounds the number of key-value pairs present in it. It can be configured to + * drop the least recently user pair or the earliest inserted pair. It exposes a + * scala.collection.mutable.Map interface to allow it to be a drop-in replacement for Scala + * HashMaps. + * + * Internally, a Java LinkedHashMap is used to get insert-order or access-order behavior. + * Note that the LinkedHashMap is not thread-safe and hence, it is wrapped in a + * Collections.synchronizedMap. However, getting the Java HashMap's iterator and + * using it can still lead to ConcurrentModificationExceptions. Hence, the iterator() + * function is overridden to copy the all pairs into an ArrayBuffer and then return the + * iterator to the ArrayBuffer. Also, the class apply the trait SynchronizedMap which + * ensures that all calls to the Scala Map API are synchronized. This together ensures + * that ConcurrentModificationException is never thrown. + * + * @param bound max number of key-value pairs + * @param useLRU true = least recently used/accessed will be dropped when bound is reached, + *false = earliest inserted will be dropped + */ +private[spark] class BoundedHashMap[A, B](bound: Int, useLRU: Boolean) + extends WrappedJavaHashMap[A, B, A, B] with SynchronizedMap[A, B] { + + protected[util] val internalJavaMap = Collections.synchronizedMap(new LinkedHashMap[A, B]( +bound / 8, (0.75).toFloat, useLRU) { +override protected def removeEldestEntry(eldest: JMapEntry[A, B]): Boolean = { + size() > bound +} + }) + + protected[util] def newInstance[K1, V1](): WrappedJavaHashMap[K1, V1, _, _] = { --- End diff -- what does protected[util] mean in Scala? You can access it in all child classes of all classes in util? Pretty confusing here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1096, a space after comment start style ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/124#discussion_r10553024 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMessages.scala --- @@ -35,9 +35,9 @@ private[storage] object BlockManagerMessages { case class RemoveRdd(rddId: Int) extends ToBlockManagerSlave - // + // // Messages from slaves to the master. - // + // sealed trait ToBlockManagerMaster --- End diff -- I will let @pwendell decide that ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/126#issuecomment-37501259 @tdas I haven't finished looking at this (will probably spend more time after Fri) - but WrappedJavaHashMap is fairly complicated, and it seems like a recipe for complexity and potential performance disaster (due to clone & lots of locks). How are its implementations used? Do you actually traverse an entire hashmap to find values? I am not sure if you want to do that in a cleanup. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1103] [WIP] Automatic garbage collectio...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/126#issuecomment-37501863 If you don't need high performance, why not just put a normal immutable hashmap so you don't have to worry about concurrency? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1237, 1238] Improve the computation of ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/131#issuecomment-37507055 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1240: handle the case of empty RDD when ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/135#discussion_r10579726 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -310,6 +310,9 @@ abstract class RDD[T: ClassTag]( * Return a sampled subset of this RDD. */ def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T] = { +if (fraction < Double.MinValue || fraction > Double.MaxValue) { --- End diff -- Use require. i.e. ```scala require(fraction > Double.MinValue && fraction < Double.MaxValue, "...") ``` Shouldn't you just check for fraction > 0 but < 1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1240: handle the case of empty RDD when ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/135#discussion_r10581086 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -457,6 +457,10 @@ class RDDSuite extends FunSuite with SharedSparkContext { test("takeSample") { val data = sc.parallelize(1 to 100, 2) +val emptySet = data.filter(_ => false) --- End diff -- yup do data.mapPartitions { iter => Iterator.empty } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: MLI-1 Decision Trees
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/79#issuecomment-37607046 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix serialization of MutablePair. Also provide...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/141#issuecomment-37681568 Merge.d --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-37691023 I was thinking maybe we want a config option for this - which is on by default, but can be turned off. What do you guys think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix serialization of MutablePair. Also provide...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/141#issuecomment-37717648 I did : https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=e19044cb1048c3755d1ea2cb43879d2225d49b54 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/149 SPARK-1255: Allow user to pass Serializer object instead of class name for shuffle. This is more general than simply passing a string name and leaves more room for performance optimizations. Note that this is technically an API breaking change - but I suspect nobody else in this world has used this API other than me in GraphX and Shark. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark serializer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/149.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #149 commit 7420185373ca90950af0f24831ad3ef10c097acc Author: Reynold Xin Date: 2014-03-15T08:09:15Z Allow user to pass Serializer object instead of class name for shuffle. This is more general than simply passing a string name and leaves more room for performance optimizations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/149#issuecomment-37720234 @marmbrus this is for you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/149#discussion_r10636356 --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala --- @@ -43,12 +44,13 @@ abstract class NarrowDependency[T](rdd: RDD[T]) extends Dependency(rdd) { * Represents a dependency on the output of a shuffle stage. * @param rdd the parent RDD * @param partitioner partitioner used to partition the shuffle output - * @param serializerClass class name of the serializer to use + * @param serializer [[Serializer]] to use. If set to null, the default serializer, as specified + * by `spark.serializer` config option, will be used. */ class ShuffleDependency[K, V]( @transient rdd: RDD[_ <: Product2[K, V]], val partitioner: Partitioner, -val serializerClass: String = null) +val serializer: Serializer = null) --- End diff -- Compared with passing string, this adds about 58B to the task closure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/149#discussion_r10636602 --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala --- @@ -43,12 +44,13 @@ abstract class NarrowDependency[T](rdd: RDD[T]) extends Dependency(rdd) { * Represents a dependency on the output of a shuffle stage. * @param rdd the parent RDD * @param partitioner partitioner used to partition the shuffle output - * @param serializerClass class name of the serializer to use + * @param serializer [[Serializer]] to use. If set to null, the default serializer, as specified + * by `spark.serializer` config option, will be used. */ class ShuffleDependency[K, V]( @transient rdd: RDD[_ <: Product2[K, V]], val partitioner: Partitioner, -val serializerClass: String = null) +val serializer: Serializer = null) --- End diff -- No it wasn't. SparkConf was not serializable. On Saturday, March 15, 2014, Mridul Muralidharan wrote: > In core/src/main/scala/org/apache/spark/Dependency.scala: > > > */ > > class ShuffleDependency[K, V]( > > @transient rdd: RDD[_ <: Product2[K, V]], > > val partitioner: Partitioner, > > -val serializerClass: String = null) > > +val serializer: Serializer = null) > > Is it 'cos the SparkConf is getting shipped around ? > If yes, using the broadcast variable should alleviate it ? > > â > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/149/files#r10636424> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1255: Allow user to pass Serializer obje...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/149#discussion_r10637576 --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala --- @@ -43,12 +44,13 @@ abstract class NarrowDependency[T](rdd: RDD[T]) extends Dependency(rdd) { * Represents a dependency on the output of a shuffle stage. * @param rdd the parent RDD * @param partitioner partitioner used to partition the shuffle output - * @param serializerClass class name of the serializer to use + * @param serializer [[Serializer]] to use. If set to null, the default serializer, as specified + * by `spark.serializer` config option, will be used. */ class ShuffleDependency[K, V]( @transient rdd: RDD[_ <: Product2[K, V]], val partitioner: Partitioner, -val serializerClass: String = null) +val serializer: Serializer = null) --- End diff -- It's only 58 bytes, not 58KB, which is less than 5% increase for serializing ShuffleDependency, which is part of any RDD task. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---