Adding new Spark workers on AWS EC2 - access error

2015-06-03 Thread barmaley
I have the existing operating Spark cluster that was launched with spark-ec2 script. I'm trying to add new slave by following the instructions: Stop the cluster On AWS console "launch more like this" on one of the slaves Start the cluster Although the new instance is added to the same security gro

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread barmaley
The issue was that SSH key generated on Spark Master was not transferred to this new slave. Spark-ec2 script with `start` command omits this step. The solution is to use `launch` command with `--resume` options. Then the SSH key is transferred to the new slave and everything goes smooth. -- View

Re: Required settings for permanent HDFS Spark on EC2

2015-06-04 Thread barmaley
Hi - I'm having similar problem with switching from ephemeral to persistent HDFS - it always looks for 9000 port regardless of options I set for 9010 persistent HDFS. Have you figured out a solution? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Req

Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread barmaley
Launching using spark-ec2 script results in: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... <...> Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond:

takeSample() results in two stages

2015-06-11 Thread barmaley
I've observed interesting behavior in Spark 1.3.1, the reason for which is not clear. Doing something as simple as sc.textFile("...").takeSample(...) always results in two stages:Spark's takeSample() results in two stages

Akka failures: Driver Disassociated

2015-06-24 Thread barmaley
I'm running Spark 1.3.1 on AWS... Having long-running application (spark context) which accepts and completes jobs fine. However, it crashes at as it seems random times (anywhere from 1 hour and up to 6 days)... At a latter case, context run and finished hundreds of jobs without an issue and then s

[POWERED BY] Please add our organization

2015-09-23 Thread barmaley
Name: Frontline Systems Inc. URL: www.solver.com Description: • We built an interface between Microsoft Excel and Apache Spark - bringing Big Data from the clusters to Excel enabling tools ranging from simple charts and Power View dashboards to add-ins for machine learning and predictive an

Re: Add row IDs column to data frame

2015-04-08 Thread barmaley
Hi Bojan, Could you please expand your idea on how to append to RDD? I can think of how to append a constant value to each row on RDD: //oldRDD - RDD[Array[String]] val c = "const" val newRDD = oldRDD.map(r=>c+:r) But how to append a custom column to RDD? Something like: val colToAppend = sc.ma

Spark-csv data source: infer data types

2015-04-18 Thread barmaley
I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv files into Spark DataFrames. Everything works but all columns are assumed to be of StringType. As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html), fo

Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread barmaley
Given a registered table from data frame, I'm able to execute queries like sqlContext.sql("SELECT STDDEV(col1) FROM table") from Spark Shell just fine. However, when I run exactly the same code in a standalone app on a cluster, it throws an exception: "java.util.NoSuchElementException: key not foun