Re: enum-like types in Spark

2015-03-04 Thread Aaron Davidson
That's kinda annoying, but it's just a little extra boilerplate. Can you call it as StorageLevel.DiskOnly() from Java? Would it also work if they were case classes with empty constructors, without the field? On Wed, Mar 4, 2015 at 11:35 PM, Xiangrui Meng wrote: > `case object` inside an `object`

Re: enum-like types in Spark

2015-03-04 Thread Xiangrui Meng
`case object` inside an `object` doesn't show up in Java. This is the minimal code I found to make everything show up correctly in both Scala and Java: sealed abstract class StorageLevel // cannot be a trait object StorageLevel { private[this] case object _MemoryOnly extends StorageLevel fina

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
Yep, that makes sense. Thanks for the clarification! Mingyu On 3/4/15, 8:05 PM, "Patrick Wendell" wrote: >Yeah, it will result in a second serialized copy of the array (costing >some memory). But the computational overhead should be very small. The >absolute worst case here will be when doi

Fwd: Unable to Read/Write Avro RDD on cluster.

2015-03-04 Thread ๏̯͡๏
I am trying to read RDD avro, transform and write. I am able to run it locally fine but when i run onto cluster, i see issues with Avro. export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1 export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf" export HADOOP_CONF_DIR=/apache/hadoop/

Re: ideas for MLlib development

2015-03-04 Thread Robert Dodier
Thanks for your reply, Evan. > It may make sense to have a more general Gibbs sampling > framework, but it might be good to have a few desired applications > in mind (e.g. higher level models that rely on Gibbs) to help API > design, parallelization strategy, etc. I think I'm more interested in a

Re: enum-like types in Spark

2015-03-04 Thread Patrick Wendell
I like #4 as well and agree with Aaron's suggestion. - Patrick On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson wrote: > I'm cool with #4 as well, but make sure we dictate that the values should > be defined within an object with the same name as the enumeration (like we > do for StorageLevel). Ot

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Yeah, it will result in a second serialized copy of the array (costing some memory). But the computational overhead should be very small. The absolute worst case here will be when doing a collect() or something similar that just bundles the entire partition. - Patrick On Wed, Mar 4, 2015 at 5:47

Re: enum-like types in Spark

2015-03-04 Thread Aaron Davidson
I'm cool with #4 as well, but make sure we dictate that the values should be defined within an object with the same name as the enumeration (like we do for StorageLevel). Otherwise we may pollute a higher namespace. e.g. we SHOULD do: trait StorageLevel object StorageLevel { case object MemoryO

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
The concern is really just the runtime overhead and memory footprint of Java-serializing an already-serialized byte array again. We originally noticed this when we were using RDD.toLocalIterator() which serializes the entire 64MB partition. We worked around this issue by kryo-serializing and snappy

Re: enum-like types in Spark

2015-03-04 Thread Michael Armbrust
#4 with a preference for CamelCaseEnums On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley wrote: > another vote for #4 > People are already used to adding "()" in Java. > > > On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch wrote: > > > #4 but with MemoryOnly (more scala-like) > > > > http://docs.sc

Re: enum-like types in Spark

2015-03-04 Thread Joseph Bradley
another vote for #4 People are already used to adding "()" in Java. On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch wrote: > #4 but with MemoryOnly (more scala-like) > > http://docs.scala-lang.org/style/naming-conventions.html > > Constants, Values, Variable and Methods > > Constant names should

Re: enum-like types in Spark

2015-03-04 Thread Stephen Boesch
#4 but with MemoryOnly (more scala-like) http://docs.scala-lang.org/style/naming-conventions.html Constants, Values, Variable and Methods Constant names should be in upper camel case. That is, if the member is final, immutable and it belongs to a package object or an object, it may be considered

enum-like types in Spark

2015-03-04 Thread Xiangrui Meng
Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types sho

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Hey Mingyu, I think it's broken out separately so we can record the time taken to serialize the result. Once we serializing it once, the second serialization should be really simple since it's just wrapping something that has already been turned into a byte buffer. Do you see a specific issue with

Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
Hi all, It looks like the result of task is serialized twice, once by serializer (I.e. Java/Kryo depending on configuration) and once again by closure serializer (I.e. Java). To link the actual code, The first one: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spar

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Sean Owen
I think we will have to fix https://issues.apache.org/jira/browse/SPARK-5143 as well before the final 1.3.x. But yes everything else checks out for me, including sigs and hashes and building the source release. I have been following JIRA closely and am not aware of other blockers besides the ones

short jenkins 7am downtime tomorrow morning (3-5-15)

2015-03-04 Thread shane knapp
the master and workers need some system and package updates, and i'll also be rebooting the machines as well. this shouldn't take very long to perform, and i expect jenkins to be back up and building by 9am at the *latest*. important note: i will NOT be updating jenkins or any of the plugins dur

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Marcelo Vanzin
-1 (non-binding) because of SPARK-6144. But aside from that I ran a set of tests on top of standalone and yarn and things look good. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.3.0! > > The tag to be voted

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Patrick Wendell
Hey Marcelo, Yes - I agree. That one trickled in just as I was packaging this RC. However, I still put this out here to allow people to test the existing fixes, etc. - Patrick On Wed, Mar 4, 2015 at 9:26 AM, Marcelo Vanzin wrote: > I haven't tested the rc2 bits yet, but I'd consider > https://i

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Marcelo Vanzin
I haven't tested the rc2 bits yet, but I'd consider https://issues.apache.org/jira/browse/SPARK-6144 a serious regression from 1.2 (since it affects existing "addFile()" functionality if the URL is "hdfs:..."). Will test other parts separately. On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell wro

Re: Google Summer of Code - Quick Query

2015-03-04 Thread Ulrich Stärk
Hi Manoj, this question is best asked on the Spark mailing lists (copied). From a formal point of view all that counts is your proposal in Melange once applications start but your mentor or the project you wish to contribute to may have additional requirements. Cheers, Uli On 2015-03-03 08:54

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Robin East
+1 (subject to comments on ec2 issues below) machine 1: Macbook Air, OSX 10.10.2 (Yosemite), Java 8 machine 2: iMac, OSX 10.8.4, Java 7 1. mvn clean package -DskipTests (33min/13min) 2. ran SVM benchmark https://github.com/insidedctm/spark-mllib-benchmark EC2 issues: 1) Unable to successfully

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application? Thank

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Krishna Sankar
It is the LR over car-data at https://github.com/xsankar/cloaked-ironman. 1.2.0 gives Mean Squared Error = 40.8130551358 1.3.0 gives Mean Squared Error = 105.857603953 I will verify it one more time tomorrow. Cheers On Tue, Mar 3, 2015 at 11:28 PM, Xiangrui Meng wrote: > On Tue, Mar 3, 2015 a