Re: DBSCAN for MLlib

2015-01-16 Thread Muhammad Ali A'råby
Please find my answers on JIRA page. Muhammad-Ali On Thursday, January 15, 2015 3:25 AM, Xiangrui Meng wrote: Please find my comments on the JRIA page. -Xiangrui On Tue, Jan 13, 2015 at 1:49 PM, Muhammad Ali A'råby wrote: > I have to say, I have created a Jira task for it: > [SPARK

Re: RDD order guarantees

2015-01-16 Thread Ewan Higgs
Yes, I am running on a local file system. Is there a bug open for this? Mingyu Kim reported the problem last April: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html -Ewan On 01/16/2015 07:41 PM, Reynold Xin wrote: You are running on a local

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
code updated. sorry, wrong branch uploaded before. On Fri, Jan 16, 2015 at 2:13 PM, Kushal Datta wrote: > The source code is under a new module named 'graphx'. let me double check. > > On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott > wrote: > >> Looking at https://github.com/kdatta/tinkerpop3/co

Re: Join implementation in SparkSQL

2015-01-16 Thread Yin Huai
Hi Alex, Can you attach the output of sql("explain extended ").collect.foreach(println)? Thanks, Yin On Fri, Jan 16, 2015 at 1:54 PM, Alessandro Baretta wrote: > Reynold, > > The source file you are directing me to is a little too terse for me to > understand what exactly is going on. Let me

Re: Spark SQL API changes and stabilization

2015-01-16 Thread Reynold Xin
That's a good idea. We didn't intentionally break the doc generation. The doc generation for Catalyst is broken because we use Scala macros and we haven't had time to investigate how to fix it yet. If you have a minute and want to investigate, I can merge it in as soon as possible. On Fri, Ja

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
The source code is under a new module named 'graphx'. let me double check. On Fri, Jan 16, 2015 at 2:11 PM, Kyle Ellrott wrote: > Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I > only see a maven build file. Do you have some source code some place else? > > I've worked

Re: Spark SQL API changes and stabilization

2015-01-16 Thread Alessandro Baretta
Reynold, Your clarification is much appreciated. One issue though, that I would strongly encourage you to work on, is to make sure that the Scaladoc CAN be generated manually if needed (a "Use at your own risk" clause would be perfectly legitimate here). The reason I say this is that currently eve

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kyle Ellrott
Looking at https://github.com/kdatta/tinkerpop3/compare/graphx-gremlin I only see a maven build file. Do you have some source code some place else? I've worked on a spark based implementation ( https://github.com/kellrott/spark-gremlin ), but its not done and I've been tied up on other projects. I

Re: Join implementation in SparkSQL

2015-01-16 Thread Alessandro Baretta
Reynold, The source file you are directing me to is a little too terse for me to understand what exactly is going on. Let me tell you what I'm trying to do and what problems I'm encountering, so that you might be able to better direct me investigation of the SparkSQL codebase. I am computing the

Re: Implementing TinkerPop on top of GraphX

2015-01-16 Thread Kushal Datta
Hi David, Yes, we are still headed in that direction. Please take a look at the repo I sent earlier. I think that's a good starting point. Thanks, -Kushal. On Thu, Jan 15, 2015 at 8:31 AM, David Robinson wrote: > I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and > think

Spectral clustering

2015-01-16 Thread Andrew Musselman
Hi, thinking of picking up this Jira ticket: https://issues.apache.org/jira/browse/SPARK-4259 Anyone done any work on this to date? Any thoughts on it before we go too far in? Thanks! Best Andrew

Spark

2015-01-16 Thread Andrew Musselman

Re: Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Michael Armbrust
+1 to adding such an optimization to parquet. The bytes are tagged specially as UTF8 in the parquet schema so it seem like it would be possible to add this. On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies wrote: > Hi, > > It seems that a reasonably large proportion of query time using Spark SQL >

Re: RDD order guarantees

2015-01-16 Thread Reynold Xin
You are running on a local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference. We might be able to do a sort and order the partitions when we create a RDD to make this universal though. On Fri, Jan 16, 2015 at 8:26 AM, Ewan

Re: Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Marcelo Vanzin
On Fri, Jan 16, 2015 at 10:07 AM, Michel Dufresne wrote: > Thank for your reply, I've should have mentioned that spark-env.sh is the > only option i found because: > >- I'm creating the SpeakConf/SparkContext from a Play Application >(therefore I'm not using spark-submit script) Then you

Re: Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Michel Dufresne
Thank for your reply, I've should have mentioned that spark-env.sh is the only option i found because: - I'm passing the public IP address of the slave (which is determined in the shell script) - I'm creating the SpeakConf/SparkContext from a Play Application (therefore I'm not using s

Re: Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Zhan Zhang
You can try to add it in in conf/spark-defaults.conf # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three” Thanks. Zhan Zhang On Jan 16, 2015, at 9:56 AM, Michel Dufresne wrote: > Hi All, > > I'm trying to set some JVM options to the executor process

Setting JVM options to Spark executors in Standalone mode

2015-01-16 Thread Michel Dufresne
Hi All, I'm trying to set some JVM options to the executor processes in a standalone cluster. Here's what I have in *spark-env.sh*: jmx_opt="-Dcom.sun.management.jmxremote" > jmx_opt="${jmx_opt} -Djava.net.preferIPv4Stack=true" > jmx_opt="${jmx_opt} -Dcom.sun.management.jmxremote.port=" > jmx

RDD order guarantees

2015-01-16 Thread Ewan Higgs
Hi all, Quick one: when reading files, are the orders of partitions guaranteed to be preserved? I am finding some weird behaviour where I run sortByKeys() on an RDD (which has 16 byte keys) and write it to disk. If I open a python shell and run the following: for part in range(29): print

Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Mick Davies
Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user mailing

Fwd: LinearRegressionWithSGD accuracy

2015-01-16 Thread Robin East
Sent from my iPhone Begin forwarded message: > From: Robin East > Date: 16 January 2015 11:35:23 GMT > To: Joseph Bradley > Cc: Yana Kadiyska , Devl Devel > > Subject: Re: LinearRegressionWithSGD accuracy > > Yes with scaled data intercept would be 5000 but the code as it stands is > runn