about spark interactive shell

2014-05-12 Thread fengshen
i email to user list,but no body relpy me. so, i email to this. i hope relpy I am now using spark in production. but I notice spark driver including rdd and dag... and the executors will try to register with the driver. but in my company the executors do not register with the client because of

Any ideas on SPARK-1021?

2014-05-12 Thread Mark Hamstra
I'm trying to decide whether attacking the underlying issue of RangePartitioner running eager jobs in rangeBounds (i.e. SPARK-1021) is a better option than a messy workaround for some async job-handling stuff that I am working on. It looks like there have been a couple of aborted attempts to solve

[EC2] r3 instance type

2014-05-12 Thread Han JU
Hi, I'm modifying the ec2 script for the new r3 instance support, but there's a problem with the instance storage. For example, `r3.large` has a single 32GB SSD disk, the problem is that it's a SSD with TRIM technology and is not automatically formatted and mounted, `lsblk` gives me this after ec

Re: LabeledPoint dump LibSVM if SparseVector

2014-05-12 Thread Xiangrui Meng
Hi Deb, There is a saveAsLibSVMFile in MLUtils now. Also, I submitted a PR for standardizing text format of vectors and labeled point: https://github.com/apache/spark/pull/685 Best, Xiangrui On Sun, May 11, 2014 at 9:40 AM, Debasish Das wrote: > Hi, > > I need to change the toString on LabeledP

Re: mllib vector templates

2014-05-12 Thread Debasish Das
Hi, I see ALS is still using Array[Int] but for other mllib algorithm we moved to Vector[Double] so that it can support either dense and sparse formats... I know ALS can stay in Array[Int] due to the Netflix format for input datasets which is well defined but it helps if we move ALS to Vector[Dou

Re: Bug is KryoSerializer under Mesos [work-around included]

2014-05-12 Thread Matei Zaharia
Hey Soren, are you sure that the JAR you used on the executors is for the right version of Spark? Maybe they’re running an older version. The Kryo serializer should be initialized the same way on both. Matei On May 12, 2014, at 10:39 AM, Soren Macbeth wrote: > I finally managed to track down

Re: Spark on Scala 2.11

2014-05-12 Thread Anand Avati
Matei, Thanks for confirming. I was looking specifically at the REPL part and how it can be significantly simplified with 2.11 Scala, without having to inherit a full copy of a refactored repl inside Spark. I am happy to investigate/contribute a simpler 2.11 based REPL if this is were seen as a pri

Kryo not default?

2014-05-12 Thread Anand Avati
Hi, Can someone share the reason why Kryo serializer is not the default? Is there anything to be careful about (because of which it is not enabled by default)? Thanks!

Re: [EC2] r3 instance type

2014-05-12 Thread Shivaram Venkataraman
I ran into this a couple of days back as well. Yes, we need to check if /dev/xvdb is formatted and if not create xfs or some such filesystem on it. We will need to change the deployment script and you can do that (similar to EBS volumes) at https://github.com/mesos/spark-ec2/blob/v2/setup-slave.sh

Bug is KryoSerializer under Mesos [work-around included]

2014-05-12 Thread Soren Macbeth
I finally managed to track down the source of the kryo issues that I was having under mesos. What happens is the for a reason that I haven't tracked down yet, a handful of the scala collection classes from chill-scala down get registered by the mesos executors, but they do all get registered in th

Re: Spark on Scala 2.11

2014-05-12 Thread Jacek Laskowski
On Sun, May 11, 2014 at 11:08 PM, Matei Zaharia wrote: > We do want to support it eventually, possibly as early as Spark 1.1 (which > we’d cross-build on Scala 2.10 and 2.11). If someone wants to look at it > before, feel free to do so! Scala 2.11 is very close to 2.10 so I think > things will

Re: Spark on Scala 2.11

2014-05-12 Thread Matei Zaharia
Anyone can actually open a JIRA on https://issues.apache.org/jira/browse/SPARK. I’ve created one for this now: https://issues.apache.org/jira/browse/SPARK-1812. Matei On May 12, 2014, at 3:54 PM, Jacek Laskowski wrote: > On Sun, May 11, 2014 at 11:08 PM, Matei Zaharia > wrote: >> We do want

Re: Kryo not default?

2014-05-12 Thread Matei Zaharia
It was just because it might not work with some user data types that are Serializable. But we should investigate it, as it’s the easiest thing one can enable to improve performance. Matei On May 12, 2014, at 2:47 PM, Anand Avati wrote: > Hi, > Can someone share the reason why Kryo serializer

Re: Kryo not default?

2014-05-12 Thread Andrew Ash
As an example of where it sometimes doesn't work, in older versions of Kryo / Chill the Joda LocalDate class didn't serialize properly -- https://groups.google.com/forum/#!topic/cascalog-user/35cdnNIamKU On Mon, May 12, 2014 at 4:39 PM, Reynold Xin wrote: > The main reason is that it doesn't al

Re: Any ideas on SPARK-1021?

2014-05-12 Thread Andrew Ash
This is the issue where .sortByKey() launches a cluster job when it shouldn't because it's a transformation not an action. https://issues.apache.org/jira/browse/SPARK-1021 I'd appreciate a fix too but don't currently have any thoughts on how to proceed forward. Andrew On Thu, May 8, 2014 at 2:

Re: Spark on Scala 2.11

2014-05-12 Thread Jacek Laskowski
Thanks a lot! Jacek On Tue, May 13, 2014 at 1:54 AM, Matei Zaharia wrote: > Anyone can actually open a JIRA on > https://issues.apache.org/jira/browse/SPARK. I’ve created one for this now: > https://issues.apache.org/jira/browse/SPARK-1812. > > Matei > > On May 12, 2014, at 3:54 PM, Jacek Lask

Re: Spark on Scala 2.11

2014-05-12 Thread Matei Zaharia
We can build the REPL separately for each version of Scala, or even give that package a different name in Scala 2.11. Scala 2.11’s REPL actually added two flags, -Yrepl-class-based and -Yrepl-outdir, that encompass the two modifications we made to the REPL (using classes instead of objects to w

Re: Spark on Scala 2.11

2014-05-12 Thread Anand Avati
On Mon, May 12, 2014 at 6:27 PM, Matei Zaharia wrote: > We can build the REPL separately for each version of Scala, or even give > that package a different name in Scala 2.11. > OK. > Scala 2.11’s REPL actually added two flags, -Yrepl-class-based and > -Yrepl-outdir, that encompass the two modi

Preliminary Parquet numbers and including .count() in Catalyst

2014-05-12 Thread Andrew Ash
Hi Spark devs, First of all, huge congrats on the parquet integration with SparkSQL! This is an incredible direction forward and something I can see being very broadly useful. I was doing some preliminary tests to see how it works with one of my workflows, and wanted to share some numbers that p

Re: Updating docs for running on Mesos

2014-05-12 Thread Andrew Ash
For trimming the Running Alongside Hadoop section I mostly think there should be a separate Spark+HDFS section and have the CDH+HDP page be merged into that one, but I supposed that's a separate docs change. On Sun, May 11, 2014 at 4:28 PM, Andy Konwinski wrote: > Thanks for suggesting this and

Re: Requirements of objects stored in RDDs

2014-05-12 Thread Andrew Ash
An RDD can hold objects of any type. If you generally think of it as a distributed Collection, then you won't ever be that far off. As far as serialization, the contents of an RDD must be serializable. There are two serialization libraries you can use with Spark: normal Java serialization or Kry

Re: Updating docs for running on Mesos

2014-05-12 Thread Andrew Ash
As far as I know, the upstream doesn't release binaries, only source code. The downloads page for 0.18.0 only has a source tarball. Is there a binary release somewhere from Mesos that I'm missing? On Sun, May 11, 2014 at 2:16 PM, Patrick Wendell wrote: >

Re: Kryo not default?

2014-05-12 Thread Reynold Xin
The main reason is that it doesn't always work (e.g. sometimes application program has special serialization / externalization written already for Java which don't work in Kryo). On Mon, May 12, 2014 at 5:47 PM, Anand Avati wrote: > Hi, > Can someone share the reason why Kryo serializer is not t

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-12 Thread Reynold Xin
Thanks for the experiments and analysis! I think Michael already submitted a patch that avoids scanning all columns for count(*) or count(1). On Mon, May 12, 2014 at 9:46 PM, Andrew Ash wrote: > Hi Spark devs, > > First of all, huge congrats on the parquet integration with SparkSQL! This > is

Re: Updating docs for running on Mesos

2014-05-12 Thread Andrew Ash
I have a draft of my proposed changes here: https://github.com/apache/spark/pull/756 https://issues.apache.org/jira/browse/SPARK-1818 Thanks! Andrew On Mon, May 12, 2014 at 9:57 PM, Andrew Ash wrote: > As far as I know, the upstream doesn't release binaries, only source code. > The downloads