Re: NullWritable not serializable

2014-09-12 Thread Matei Zaharia
Hi Du, I don't think NullWritable has ever been serializable, so you must be doing something differently from your previous program. In this case though, just use a map() to turn your Writables to serializable types (e.g. null and String). Matie On September 12, 2014 at 8:48:36 PM, Du Li (l...

NullWritable not serializable

2014-09-12 Thread Du Li
Hi, I was trying the following on spark-shell (built with apache master and hadoop 2.4.0). Both calling rdd2.collect and calling rdd3.collect threw java.io.NotSerializableException: org.apache.hadoop.io.NullWritable. I got the same problem in similar code of my app which uses the newly released

Response to archived question 'Spark and Scala Worksheet'

2014-09-12 Thread Rajiv Abraham
Hi, This is a response to an archived email about how to run Spark in a Scala worksheet in the Scala IDE. http://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3ccaauywg8a+mjqwhtgytz0lumntlgfwa-noxtopeyadeq+gws...@mail.gmail.com%3E I know it's a bit late :) but here is how I do it. ht

Re: don't trigger tests when only .md files are changed

2014-09-12 Thread Reynold Xin
I like that idea, but the load on Jenkins isn't very high. The more complexity we add to the test script, the easier it is to screw it up (at some point we would need to add unit tests for the build scripts). Maybe we can just add the message part, so it becomes clear that a pull request does not

Re: Adding abstraction in MLlib

2014-09-12 Thread Patrick Wendell
We typically post design docs on JIRA's before major work starts. For instance, pretty sure SPARk-1856 will have a design doc posted shortly. On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson wrote: > > Are interface designs being captured anywhere as documents that the community > can follow alo

Re: don't trigger tests when only .md files are changed

2014-09-12 Thread Nicholas Chammas
We could still have Jenkins post a message to the effect of “this patch only modifies .md files; no tests will be run”. ​ On Fri, Sep 12, 2014 at 3:48 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Would it make sense to have Jenkins *not* trigger tests when the only > files that hav

don't trigger tests when only .md files are changed

2014-09-12 Thread Nicholas Chammas
Would it make sense to have Jenkins *not* trigger tests when the only files that have changed are .md files (example )? Those don’t even need RAT checks, right? I can make this change if it makes sense. Nick ​

Re: Adding abstraction in MLlib

2014-09-12 Thread Erik Erlandson
Are interface designs being captured anywhere as documents that the community can follow along with as the proposals evolve? I've worked on other open source projects where design docs were published as "living documents" (e.g. on google docs, or etherpad, but the particular mechanism isn't cr

Re: parquet predicate / projection pushdown into unionAll

2014-09-12 Thread Michael Armbrust
Yeah, thanks for implementing it! Since Spark SQL is an alpha component and moving quickly the plan is to backport all of master into the next point release in the 1.1 series. On Fri, Sep 12, 2014 at 9:27 AM, Cody Koeninger wrote: > Cool, thanks for your help on this. Any chance of adding it t

Re: Adding abstraction in MLlib

2014-09-12 Thread Xiangrui Meng
Hi Egor, Thanks for the feedback! We are aware of some of the issues you mentioned and there are JIRAs created for them. Specifically, I'm pushing out the design on pipeline features and algorithm/model parameters this week. We can move our discussion to https://issues.apache.org/jira/browse/SPARK

Re: Spark authenticate enablement

2014-09-12 Thread Sandy Ryza
Hi Jun, I believe that's correct that Spark authentication only works against YARN. -Sandy On Thu, Sep 11, 2014 at 2:14 AM, Jun Feng Liu wrote: > Hi, there > > I am trying to enable the authentication on spark on standealone model. > Seems like only SparkSubmit load the properties from spark-d

A Spark Compilation Question

2014-09-12 Thread Hansu GU
I downloaded the source and imported it into IntelliJ 13.1 as a Maven project. When I used IntelliJ Build -> make Project, I encountered: Error:(44, 66) not found: type SparkFlumeProtocol val transactionTimeout: Int, val backOffInterval: Int) extends SparkFlumeProtocol with Logging { I think the

Re: Adding abstraction in MLlib

2014-09-12 Thread Reynold Xin
Xiangrui can comment more, but I believe Joseph and him are actually working on standardize interface and pipeline feature for 1.2 release. On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov wrote: > Some architect suggestions on this matter - > https://github.com/apache/spark/pull/2371 > > 2014-09-1

Re: parquet predicate / projection pushdown into unionAll

2014-09-12 Thread Cody Koeninger
Cool, thanks for your help on this. Any chance of adding it to the 1.1.1 point release, assuming there ends up being one? On Wed, Sep 10, 2014 at 11:39 AM, Michael Armbrust wrote: > Hey Cody, > > Thanks for doing this! Will look at your PR later today. > > Michael > > On Wed, Sep 10, 2014 at 9

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur wrote: > Hi, > > We have a use case where we are plannin

Re: Junit spark tests

2014-09-12 Thread Rajiv Abraham
Hi Sudershan, That's interesting. I don't have an answer to your question but considering the functional nature of Spark, I have hardly had to use mock objects(maybe you could inform us of your use case). Mock object 'expectations' are in 'most' cases implementation of 'Tell, Don't ask' principle

Re: Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Some architect suggestions on this matter - https://github.com/apache/spark/pull/2371 2014-09-12 16:38 GMT+04:00 Egor Pahomov : > Sorry, I misswrote - I meant learners part of framework - models already > exists. > > 2014-09-12 15:53 GMT+04:00 Christoph Sawade < > christoph.saw...@googlemail.com

Re: Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Sorry, I misswrote - I meant learners part of framework - models already exists. 2014-09-12 15:53 GMT+04:00 Christoph Sawade : > I totally agree, and we discovered also some drawbacks with the > classification models implementation that are based on GLMs: > > - There is no distinction between pr

Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Archit Thakur
Hi, We have a use case where we are planning to keep sparkcontext alive in a server and run queries on it. But the issue is we have a continuous flowing data the comes in batches of constant duration(say, 1hour). Now we want to exploit the schemaRDD and its benefits of columnar caching and compre

Re: Reporting serialized task size after task broadcast change?

2014-09-12 Thread Guru Medasani
I thought we could see this on the Spark Web UI storage tab. May be I was looking at something else too. On Sep 11, 2014, at 8:47 PM, Sandy Ryza wrote: > Hmm, well I can't find it now, must have been hallucinating. Do you know > off the top of your head where I'd be able to find the size to lo

Re: Adding abstraction in MLlib

2014-09-12 Thread Christoph Sawade
I totally agree, and we discovered also some drawbacks with the classification models implementation that are based on GLMs: - There is no distinction between predicting scores, classes, and calibrated scores (probabilities). For these models it is common to have access to all of them and the pred

Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Here in Yandex, during implementation of gradient boosting in spark and creating our ML tool for internal use, we found next serious problems in MLLib: - There is no Regression/Classification model abstraction. We were building abstract data processing pipelines, which should work just with

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Dibyendu Bhattacharya
Dear all, I am sorry. This was a false alarm There was some issue in the RDD processing logic which leads to large backlog. Once I fixed the issues in my processing logic, I can see all messages being pulled nicely without any Block Removed error. I need to tune certain configurations in my Kafka

Re: PSA: SI-8835 (Iterator 'drop' method has a complexity bug causing quadratic behavior)

2014-09-12 Thread Reynold Xin
Thanks for the email, Erik. The Scala collection library implementation is a complicated beast ... On Sat, Sep 6, 2014 at 8:27 AM, Erik Erlandson wrote: > I tripped over this recently while preparing a solution for SPARK-3250 > (efficient sampling): > > Iterator 'drop' method has a complexity

Re: Questions regarding memory usage

2014-09-12 Thread Sean Owen
On Thu, Sep 11, 2014 at 10:17 PM, Tom wrote: > If I set SPARK_DRIVER_MEMORY to x GB, Spark reports > /14/09/11 15:36:41 INFO MemoryStore: MemoryStore started with capacity > ~0.55*x GB/ > *Question:* > Does this relate to spark.storage.memoryFraction (default 0.6), and is the > other 0.4 used by s