Re: NA value handling in sparkR

2016-01-26 Thread Deborah Siegel
etosa, virginica would be created with 0 and 1 as > values On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel wrote: > Maybe not ideal, but since read.df is inferring all columns from the csv > containing "NA" as type of strings, one could filter them rather than using > dropna()

Re: NA value handling in sparkR

2016-01-25 Thread Deborah Siegel
think the problem is with reading of csv files. read.df is not > considering NAs in the CSV file > > So what would be a workable solution in dealing with NAs in csv files? > > > > On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel > wrote: > >> Hi Devesh, >> >&

SparkR pca?

2015-09-18 Thread Deborah Siegel
Hi, Can PCA be implemented in a SparkR-MLLib integration? perhaps 2 separate issues.. 1) Having the methods in SparkRWrapper and RFormula which will send the right input types through the pipeline MLLib PCA operates either on a RowMatrix, or the feature vector of an RDD[LabeledPoint]. The label

Re: SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
in-hadoop2.4/bin/spark-submit` exists ? The > error message seems to indicate it is trying to pick up Spark from > that location and can't seem to find Spark installed there. > > Thanks > Shivaram > > On Thu, Aug 20, 2015 at 3:30 PM, Deborah Siegel > wrote: > > He

SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
Hello, I have previously successfully run SparkR in RStudio, with: >Sys.setenv(SPARK_HOME="~/software/spark-1.4.1-bin-hadoop2.4") >.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) >library(SparkR) >sc <- sparkR.init(master="local[2]",appName="SparkR-example") Then I tr

Re: SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
I think I just answered my own question. The privitization of the RDD API might have resulted in my error, because this worked: > randomMatBr <- SparkR:::broadcast(sc, randomMat) On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel wrote: > Hello, > > In looking at the SparkR codeba

SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
Hello, In looking at the SparkR codebase, it seems as if broadcast variables ought to be working based on the tests. I have tried the following in sparkR shell, and similar code in RStudio, but in both cases got the same message > randomMat <- matrix(nrow=10, ncol=10, data=rnorm(100)) > randomMa

contributing code - how to test

2015-04-24 Thread Deborah Siegel
Hi, I selected a "starter task" in JIRA, and made changes to my github fork of the current code. I assumed I would be able to build and test. % mvn clean compile was fine but %mvn package failed [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test

Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread Deborah Siegel
Harika, I think you can modify existing spark on ec2 cluster to run Yarn mapreduce, not sure if this is what you are looking for. To try: 1) logon to master 2) go into either ephemeral-hdfs/conf/ or persistent-hdfs/conf/ and add this to mapred-site.xml : mapreduce.framework.name ya

ec2 persistent-hdfs with ebs using spot instances

2015-03-10 Thread Deborah Siegel
Hello, I'm new to ec2. I've set up a spark cluster on ec2 and am using persistent-hdfs with the data nodes mounting ebs. I launched my cluster using spot-instances ./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z us-east-1c --spark-version=1.2.0 --spot-price=.0321 --hadoop-maj

documentation - graphx-programming-guide error?

2015-03-01 Thread Deborah Siegel
Hello, I am running through examples given on http://spark.apache.org/docs/1.2.1/graphx-programming-guide.html The section for Map Reduce Triplets Transition Guide (Legacy) indicates that one can run the following .aggregateMessages code val graph: Graph[Int, Float] = ... def msgFun(triplet: Edg

Re: Number of cores per executor on Spark Standalone

2015-03-01 Thread Deborah Siegel
Hi, Someone else will have a better answer. I think that for standalone mode, executors will grab whatever cores they can based on either configurations on the worker, or application specific configurations. Could be wrong, but I believe mesos is similar to this- and that YARN is alone in the abil

Re: Running spark function on parquet without sql

2015-02-27 Thread Deborah Siegel
Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: "Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will *not* be cached using the in-memory columnar format, and therefore sqlContext.cacheTa

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Deborah Siegel
Hi Abe, I'm new to Spark as well, so someone else could answer better. A few thoughts which may or may not be the right line of thinking.. 1) Spark properties can be set on the SparkConf, and with flags in spark-submit, but settings on SparkConf take precedence. I think your jars flag for spark-su

Re: My first experience with Spark

2015-02-05 Thread Deborah Siegel
Hi Yong, Have you tried increasing your level of parallelism? How many tasks are you getting in failing stage? 2-3 tasks per CPU core is recommended, though maybe you need more for your shuffle operation? You can configure spark.default.parallelism, or pass in a level of parallelism as second par