Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
gt; Do you extract only the stuff needed? What are the algorithm parameters? > > > On 07 Jun 2016, at 13:09, Franc Carter wrote: > > > > > > Hi, > > > > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and > am interested in how it might

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300 c

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
graphframes Python code when it is loaded as > a Spark package. > > To workaround this, I extract the graphframes Python directory locally > where I run pyspark into a directory called graphframes. > > > > > > > On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carte

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it

Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter
A colleague found how to do this, the approach was to use a udf() cheers On 21 February 2016 at 22:41, Franc Carter wrote: > > I have a DataFrame that has a Python dict() as one of the columns. I'd > like to filter he DataFrame for those Rows that where the dict() contains a &g

filter by dict() key in pySpark

2016-02-21 Thread Franc Carter
I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column i

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
end the last added column( in the loop) will be the added column. like in > my code above. > > On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter > wrote: > >> >> I had problems doing this as well - I ended up using 'withColumn', it's >> not particularly g

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
I had problems doing this as well - I ended up using 'withColumn', it's not particularly graceful but it worked (1.5.2 on AWS EMR) cheerd On 3 February 2016 at 22:06, Devesh Raj Singh wrote: > Hi, > > i am trying to create dummy variables in sparkR by creating new columns > for categorical vari

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks cheers On 10 January 2016 at 22:35, Blaž Šnuderl wrote: > This can be done using spark.sql and window functions. Take a look at > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > On Sun, Jan 10, 2016 at 11:07 AM, Franc Car

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
13 101 > 32014 102 > > What's your desired output ? > > Femi > > > On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter > wrote: > >> >> Hi, >> >> I have a DataFrame with the columns >> >> ID,Year,Value >> >> I'd

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
Got it, I needed to use the when/otherwise construct - code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) x = when(n==7,day).otherwise(sun) return x On 10 January 2016 at 08:41, Franc Carter w

pyspark: calculating row deltas

2016-01-09 Thread Franc Carter
Hi, I have a DataFrame with the columns ID,Year,Value I'd like to create a new Column that is Value2-Value1 where the corresponding Year2=Year-1 At the moment I am creating a new DataFrame with renamed columns and doing DF.join(DF2, . . . .) This looks cumbersome to me, is there abt

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
My Python is not particularly good, so I'm afraid I don't understand what that mean cheers On 9 January 2016 at 14:45, Franc Carter wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of the > week of a given date, co

pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter
Hi, I'm trying to write a short function that returns the last sunday of the week of a given date, code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) if (n == 7): return day else: return sun this g

Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
t with sparkR.init()? > > > _____ > From: Franc Carter > Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: > > > > Hi, > > I'm having trouble working out how to get the number of execut

number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Hi, I'm having trouble working out how to get the number of executors set when using sparkR.init(). If I start sparkR with sparkR --master yarn --num-executors 6 then I get 6 executors However, if start sparkR with sparkR followed by sc <- sparkR.init(master="yarn-client", sparkEnvir

Re: SparkR csv without headers

2015-08-20 Thread Franc Carter
t; integer), …) > > read.df ( …, schema = schema) > > > > *From:* Franc Carter [mailto:franc.car...@rozettatech.com] > *Sent:* Wednesday, August 19, 2015 1:48 PM > *To:* user@spark.apache.org > *Subject:* SparkR csv without headers > > > > > > H

SparkR csv without headers

2015-08-18 Thread Franc Carter
-- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515

subscribe

2015-08-05 Thread Franc Carter
subscribe

Column operation on Spark RDDs.

2015-06-04 Thread Carter
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation is on columns, e.g., I need to create many intermediate variables from different columns, what is the most efficient way to do this? For example, if my dataRDD[Array[String]] is like below: 123, 523, 534, ..., 893

Re: How to add a column to a spark RDD with many columns?

2015-05-02 Thread Carter
Thanks for your reply! It is what I am after. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

How to add a column to a spark RDD with many columns?

2015-04-30 Thread Carter
Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the end of this RDD? For example, if my RDD is like below: 123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 .. 29, 94, 956,

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
> blocks: 48. Algorithm and capacity permitting, you've just massively > boosted your load time. Downstream, if data can be thinned down, then you > can start looking more at things you can do on a single host : a machine > that can be in your Hadoop cluster. Ask YARN nicely

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
gt; >>> it didn't help... > >>> > >>> **`--deploy-mode=cluster`:** > >>> > >>> From my laptop: > >>> > >>> ./bin/spark-submit --master > >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:707

How to sum up the values in the columns of a dataset in Scala?

2015-02-12 Thread Carter
I am new to Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this? I worked out a way by using f

Re: spark, reading from s3

2015-02-12 Thread Franc Carter
ffset in >>> response to >>> RequestTimeTooSkewed error. Local machine and S3 server disagree on the >>> time by approximately 0 seconds. Retrying connection. >>> >>> After that there are tons of 403/forbidden errors and then job fails. >>> It's s

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
> > Happy hacking > > Chris > > Von: Franc Carter > Datum: Mittwoch, 11. Februar 2015 10:03 > An: Paolo Platter > Cc: Mike Trienis , "user@spark.apache.org" < > user@spark.apache.org> > Betreff: Re: Datastore HDFS vs Cassandra > > > One a

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
------ > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Re: How to create spark AMI in AWS

2015-02-09 Thread Franc Carter
cala during building the AMI? > > > Thanks. > > Guodong > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
he Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Does DecisionTree model in MLlib deal with missing values?

2015-01-10 Thread Carter
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal with missing values? If so, what data structure should I use for the input? Moreover, my data has categorical features, but the LabeledPoint requires "double" data type, in this case what can I do? Thank you very much.

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
15 at 6:59 AM, Cody Koeninger wrote: > No, most rdds partition input data appropriately. > > On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter > wrote: > >> >> One more question, to be clarify. Will every node pull in all the data ? >> >> thanks >> >&

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
r implement preferred > locations. You can run an rdbms on the same nodes as spark, but JdbcRDD > doesn't implement preferred locations. > > On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter > wrote: > >> >> Hi, >> >> I'm trying to understand how a Spark C

Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
can run an rdbms on the same nodes as spark, but JdbcRDD > doesn't implement preferred locations. > > On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter > wrote: > >> >> Hi, >> >> I'm trying to understand how a Spark Cluster behaves when the data it i

Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Arch

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks for your reply Wei, will try this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Thanks a lot Krishna, this works for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compile-a-Spark-project-in-Scala-IDE-for-Eclipse-tp7197p7223.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Carter
Hi All, I just downloaded the Scala IDE for Eclipse. After I created a Spark project and clicked "Run" there was an error on this line of code "import org.apache.spark.SparkContext": "object apache is not a member of package org". I guess I need to import the Spark dependency into Scala IDE for Ec

Re: How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Thank you very much Gerard. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-help-or-explanation-for-the-functions-in-Spark-shell-tp7191p7193.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to get the help or explanation for the functions in Spark shell?

2014-06-08 Thread Carter
Hi All, I am new to Spark. In the Spark shell, how can I get the help or explanation for those functions that I can use for a variable or RDD? For example, after I input a RDD's name with a dot (.) at the end, if I press the Tab key, a list of functions that I can use for this RDD will be displa

RE: K-nearest neighbors search in Spark

2014-05-28 Thread Carter
Hi Andrew, Thank you for your info. I will have a look at these links. Thanks, Carter Date: Tue, 27 May 2014 09:06:02 -0700 From: ml-node+s1001560n6436...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: K-nearest neighbors search in Spark Hi Carter, In Spark 1.0 there will be an

RE: K-nearest neighbors search in Spark

2014-05-28 Thread Carter
Hi Krishna, Thank you very much for your code. I will use it as a good start point. Thanks, Carter Date: Tue, 27 May 2014 16:42:39 -0700 From: ml-node+s1001560n6455...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: K-nearest neighbors search in Spark Carter, Just as a quick

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Carter
Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

K-nearest neighbors search in Spark

2014-05-26 Thread Carter
very much.Regards,Carter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: "sbt/sbt run" command returns a JVM problem

2014-05-06 Thread Carter
Hi Akhil, Thanks for your reply. I have tried this option with different values, but it still doesn't work. The Java version I am using is jre1.7.0_55, does the java version matter in this problem? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

RE: "sbt/sbt run" command returns a JVM problem

2014-05-05 Thread Carter
left free? wouldn`t ubuntu take up quite a big portion of 2G? just a guess! On Sat, May 3, 2014 at 8:15 PM, Carter <[hidden email]> wrote: Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: jav

Re: "sbt/sbt run" command returns a JVM problem

2014-05-04 Thread Carter
Hi Michael, The log after I typed "last" is as below: > last scala.tools.nsc.MissingRequirementError: object scala not found. at scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$.getModule(Defi

Re: "sbt/sbt run" command returns a JVM problem

2014-05-03 Thread Carter
Hi Michael, Thank you very much for your reply. Sorry I am not very familiar with sbt. Could you tell me where to set the Java option for the sbt fork for my program? I brought up the sbt console, and run "set javaOptions += "-Xmx1G"" in it, but it returned an error: [error] scala.tools.nsc.Miss

Re: "sbt/sbt run" command returns a JVM problem

2014-05-03 Thread Carter
Hi, thanks for all your help. I tried your setting in the sbt file, but the problem is still there. The Java setting in my sbt file is: java \ -Xmx1200m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=256m \ -jar ${JAR} \ "$@" I have tried to set these 3 parameters bigger and smaller, but no

"sbt/sbt run" command returns a JVM problem

2014-05-01 Thread Carter
Hi, I have a very simple spark program written in Scala: /*** testApp.scala ***/ object testApp { def main(args: Array[String]) { println("Hello! World!") } } Then I use the following command to compile it: $ sbt/sbt package The compilation finished successfully and I got a JAR file. But wh

RE: Need help about how hadoop works.

2014-04-24 Thread Carter
split to each node.Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter <[hidden email]> wrote: Thank you very much for your help Prashant. Sorry I still have another question about your answer: "however if the file("/home/scalatest.txt") is present on the same p

Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much for your help Prashant. Sorry I still have another question about your answer: "however if the file("/home/scalatest.txt") is present on the same path on all systems it will be processed on all nodes." When presenting the file to the same path on all nodes, do we just simply c

Re: Need help about how hadoop works.

2014-04-23 Thread Carter
Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile("/home/scalatest.txt",5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spar

Need help about how hadoop works.

2014-04-22 Thread Carter
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding how hadoop works. If we have a cluster of 5 computers, and install Spark on the cluster WITHOUT Hadoop. And then we run the code on one computer: val doc = sc.textFile("/home/scalatest.txt",5) doc.count Can the "count" t