Re: Spark Application Hung

2015-03-25 Thread Akhil Das
In production, i'd suggest you having a High availability cluster with minimum of 3 nodes (data nodes in your case). Now lets examine your scenario: - When you suddenly brings down one of the node which has 2 executors running on it, what happens is that the node (DN2) will be having your jobs sh

Re: Weird exception in Spark job

2015-03-25 Thread Akhil Das
As it says, you are having a jar conflict: java.lang.NoSuchMethodError: org.jboss.netty.channel.socket.nio.NioWorkerPool.(Ljava/util/concurrent/Executor;I)V Verify your classpath and see the netty versions Thanks Best Regards On Tue, Mar 24, 2015 at 11:07 PM, nitinkak001 wrote: > Any Ideas o

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
It says: ried to associate with unreachable remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: localhost/ 127.0.0.1:51849 I'd suggest you changing this property: expo

Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Hi Sparkers, I am trying to load data in spark with the following command *sqlContext.sql("LOAD DATA LOCAL INPATH '/home/spark12/sandeep/sandeep.txt ' INTO TABLE src");* *Getting exception below* *Server IPC version 9 cannot communicate with client version 4* NOte : i am using Hadoop 2.2 ve

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Saisai Shao
Looks like you have to build Spark with related Hadoop version, otherwise you will meet exception as mentioned. you could follow this doc: http://spark.apache.org/docs/latest/building-spark.html 2015-03-25 15:22 GMT+08:00 sandeep vura : > Hi Sparkers, > > I am trying to load data in spark with th

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Is your spark compiled against hadoop 2.2? If not download the Spark 1.2 binary with Hadoop 2.2 Thanks Best Regards On Wed, Mar 25, 2015 at 12:52 PM, sandeep vura wrote: > Hi Sparkers, > > I am trying to load data in spark with the following command > >

Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Spico Florin
Hello! Thank for your responses. I was afraid that due to partitioning I will loose the logic that the first element is the header. I observe that rdd.first calls behind the rdd.take(1) method. Best regards, Florin On Tue, Mar 24, 2015 at 4:41 PM, Dean Wampler wrote: > Instead of data.zipWit

Serialization Problem in Spark Program

2015-03-25 Thread donhoff_h
Hi, experts I wrote a very simple spark program to test the KryoSerialization function. The codes are as following: object TestKryoSerialization { def main(args: Array[String]) { val conf = new SparkConf() conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") c

Spark Maven Test error

2015-03-25 Thread zzcclp
I use command to run Unit test, as follow: ./make-distribution.sh --tgz --skip-java-test -Pscala-2.10 -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.3.0-cdh5.1.2 -Dhadoop.version=2.3.0-cdh5.1.2 mvn -Pscala-2.10 -Phadoop-2.3 -Phive -Phive-thriftserver -Pyarn -Dyarn.version=2.3.0-cd

OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread SLiZn Liu
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such que

Exception in thread "main" java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;

2015-03-25 Thread Canoe
I compile spark-1.3.0 on Hadoop 2.3.0-cdh5.1.0 with protoc 2.5.0. But when I try to run the examples, it throws: Exception in thread "main" java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$PriorityProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldS

Explanation streaming-cep-engine with example

2015-03-25 Thread Dhimant
Hi, Can someone explain how spark streaming cep engine works ? How to use it with sample example? http://spark-packages.org/package/Stratio/streaming-cep-engine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Explanation-streaming-cep-engine-with-example-tp

OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Todd Leo
Hi, I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL and DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such que

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh I am running the below command in spark/yarn directory where pom.xml file is available mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Please correct me if i am wrong. On Wed, Mar 25, 2015 at 12:55 PM, S

Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Ted Yu
Can you try giving Spark driver more heap ? Cheers > On Mar 25, 2015, at 2:14 AM, Todd Leo wrote: > > Hi, > > I am using Spark SQL to query on my Hive cluster, following Spark SQL and > DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails > and java.lang.OutOfMemoryE

issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread sachin Singh
Hi , when I am submitting spark job in cluster mode getting error as under in hadoop-yarn log, someone has any idea,please suggest, 2015-03-25 23:35:22,467 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1427124496008_0028 State change from FINAL_SAVING to FAILED 2

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Just run : mvn -Pyarn -Phadoop-2.4 -D*hadoop.version=2.2* -DskipTests clean package ​ Thanks Best Regards On Wed, Mar 25, 2015 at 3:08 PM, sandeep vura wrote: > Where do i export MAVEN_OPTS in spark-env.sh or hadoop-env.sh > > I am running the below command in spark/yarn directory where pom.

Spark-sql query got exception.Help

2015-03-25 Thread 李铖
It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty("spark.kryoserializer.buffer.mb",'256')'.error still. ``` com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.e

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Xi Shen
What is your environment? I remember I had similar error when "running spark-shell --master yarn-client" in Windows environment. On Wed, Mar 25, 2015 at 9:07 PM sachin Singh wrote: > Hi , > when I am submitting spark job in cluster mode getting error as under in > hadoop-yarn log, > someone ha

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen
I think you meant to use the "--files" to deploy the DLLs. I gave a try, but it did not work. >From the Spark UI, Environment tab, I can see spark.yarn.dist.files file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.

Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Hi, I have started implementing a machine learning pipeline using Spark 1.3.0 and the new pipelining API and DataFrames. I got to a point where I have my training data set prepared using a sequence of Transformers, but I am struggling to actually train a model and use it for predictions.

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Sean Owen
NoSuchMethodError in general means that your runtime and compile-time environments are different. I think you need to first make sure you don't have mismatching versions of Spark. On Wed, Mar 25, 2015 at 11:00 AM, wrote: > Hi, > > I have started implementing a machine learning pipeline using Spa

Re: Spark-sql query got exception.Help

2015-03-25 Thread Cheng Lian
Could you please provide the full stack trace? On 3/25/15 6:26 PM, 李铖 wrote: It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty("spark.kryoserializer.buffer.mb",'256')'.error still. ``` com.esoterics

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sachin Singh
OS I am using Linux, when I will run simply as master yarn, its running fine, Regards Sachin On Wed, Mar 25, 2015 at 4:25 PM, Xi Shen wrote: > What is your environment? I remember I had similar error when "running > spark-shell --master yarn-client" in Windows environment. > > > On Wed, Mar 25,

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Sean, thanks for your response. I am familiar with NoSuchMethodException in general, but I think it is not the case this time. The code actually attempts to get parameter by name using val m = this.getClass.getMethodName (paramName). This may be a bug, but it is only a side effect caused b

Re: Spark-sql query got exception.Help

2015-03-25 Thread 李铖
Here is the full track 15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.w

How to randomise data on spark

2015-03-25 Thread critikaled
How to randomise data accross all partitions and merge them into one. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-randomise-data-on-spark-tp2.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Using ORC input for mllib algorithms

2015-03-25 Thread Zsolt Tóth
Hi, I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class, OrcStruct.class) to use data in ORC format as an RDD. I made some benchmarking on ORC input vs Text input for MLlib and I ran into a few issues with ORC. Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor memory

How do you write Dataframes to elasticsearch

2015-03-25 Thread yamanoj
It seems that elasticsearch-spark_2.10 currently not supporting spart 1.3. Could you tell me if there is an alternative way to save Dataframes to elasticsearch? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-write-Dataframes-to-elasticsearch-tp2

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
Build failed with following errors. I have executed the below following command. * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package* [INFO] [INFO] BUILD FAILURE [INFO]

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
-D*hadoop.version=2.2* Thanks Best Regards On Wed, Mar 25, 2015 at 5:34 PM, sandeep vura wrote: > Build failed with following errors. > > I have executed the below following command. > > * mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean > package* > > > [INFO] >

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread sandeep vura
*I am using hadoop 2.4 should i mention -Dhadoop.version=2.2* *$ hadoop version* *Hadoop 2.4.1* *Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1604318* *Compiled by jenkins on 2014-06-21T05:43Z* *Compiled with protoc 2.5.0* *From source

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Akhil Das
Oh, in that case you should mention 2.4, If you don't want to compile spark, then you can download the precompiled version from Downloads page . http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-hadoop2.4.tgz Thanks Best Regards On Wed, Mar 25, 2015 at

Re: Server IPC version 9 cannot communicate with client version 4

2015-03-25 Thread Sean Owen
Of course, VERSION is supposed to be replaced by a real Hadoop version! On Wed, Mar 25, 2015 at 12:04 PM, sandeep vura wrote: > Build failed with following errors. > > I have executed the below following command. > > mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package > >

Re: EC2 Having script run at startup

2015-03-25 Thread rahulkumar-aws
You can use AWS user-data feature. try this, if it help for you. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html - Software Developer SigmoidAnalytics, Bangalore -- View this message in context: http://

Re: 1.3 Hadoop File System problem

2015-03-25 Thread Jim Carroll
Thanks Patrick and Michael for your responses. For anyone else that runs across this problem prior to 1.3.1 being released, I've been pointed to this Jira ticket that's scheduled for 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 Thanks again. -- View this message in context: http:

NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe console output. However if I use conf.setMaster("local

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
Spark Streaming requires you to have minimum of 2 cores, 1 for receiving your data and the other for processing. So when you say local[2] it basically initialize 2 threads on your local machine, 1 for receiving data from network and the other for your word count processing. Thanks Best Regards On

Re: Spark-sql query got exception.Help

2015-03-25 Thread Cheng Lian
Oh, just noticed that you were calling |sc.setSystemProperty|. Actually you need to set this property in SparkConf or in spark-defaults.conf. And there are two configurations related to Kryo buffer size, * spark.kryoserializer.buffer.mb, which is the initial size, and * spark.kryoserializer.b

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Anirudha Jadhav
is there a way to have this dynamically pick the local IP. static assignment does not work cos the workers are dynamically allocated on mesos On Wed, Mar 25, 2015 at 3:04 AM, Akhil Das wrote: > It says: > ried to associate with unreachable remote address > [akka.tcp://sparkDriver@localhost:51

Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
Remove SPARK_LOCAL_IP then? Thanks Best Regards On Wed, Mar 25, 2015 at 6:45 PM, Anirudha Jadhav wrote: > is there a way to have this dynamically pick the local IP. > > static assignment does not work cos the workers are dynamically allocated > on mesos > > On Wed, Mar 25, 2015 at 3:04 AM, Akh

Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread , Roy
Yes I do have other application already running. Thanks for your explanation. On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das wrote: > It means you are already having 4 applications running on 4040, 4041, > 4042, 4043. And that's why it was able to run on 4044. > > You can do a *netstat -pnat | gr

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
Thanks Akhil, Yes indeed this is why it works when using local[2] but I'm unclear of why it doesn't work when using standalone daemons? Is there way to check what cores are being seen when running against standalone daemons? I'm running the master and worker on same ubuntu host. The Driver progr

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Peter Rudenko
Hi Martin, here’s 2 possibilities to overcome this: 1) Put your logic into org.apache.spark package in your project - then everything would be accessible. 2) Dirty trick: |object SparkVector extends HashingTF { val VectorUDT: DataType = outputDataType } | then you can do like this: |Struct

JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread ankur.jain
Hi, I am trying to run a Spark on YARN program provided by Spark in the examples directory using Amazon Kinesis on EMR cluster : I am using Spark 1.3.0 and EMR AMI: 3.5.0 I've setup the Credentials export AWS_ACCESS_KEY_ID=XX export AWS_SECRET_KEY=XXX *A) This is the Kinesis Word Count

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
You can open the Master UI running on 8080 port of your ubuntu machine and after submitting the job, you can see how many cores are being used etc from the UI. Thanks Best Regards On Wed, Mar 25, 2015 at 6:50 PM, James King wrote: > Thanks Akhil, > > Yes indeed this is why it works when using l

foreachRDD execution

2015-03-25 Thread Luis Ángel Vicente Sánchez
I have a simple and probably dumb question about foreachRDD. We are using spark streaming + cassandra to compute concurrent users every 5min. Our batch size is 10secs and our block interval is 2.5secs. At the end of the world we are using foreachRDD to join the data in the RDD with existing data

Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Arush Kharbanda
Did you built for kineses using profile *-Pkinesis-asl* On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain wrote: > Hi, > I am trying to run a Spark on YARN program provided by Spark in the > examples > directory using Amazon Kinesis on EMR cluster : > I am using Spark 1.3.0 and EMR AMI: 3.5.0 > > I've

Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin, It appears that the application master is failing. To figure out what's wrong you need to get the logs for the application master. -Sandy On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh wrote: > OS I am using Linux, > when I will run simply as master yarn, its running fine, > > Regar

What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
I have a SparkSQL dataframe with a a few billion rows that I need to quickly filter down to a few hundred thousand rows, using an operation like (syntax may not be correct) df = df[ df.filter(lambda x: x.key_col in approved_keys)] I was thinking about serializing the data using parquet and saving

Re: How do you write Dataframes to elasticsearch

2015-03-25 Thread Nick Pentreath
Spark 1.3 is not supported by elasticsearch-hadoop yet but will be very soon:  https://github.com/elastic/elasticsearch-hadoop/issues/400 However in the meantime you could use df.toRDD.saveToEs - though you may have to manipulate the Row object perhaps to extract fields, not sure if it will

Spark Streaming - Minimizing batch interval

2015-03-25 Thread RodrigoB
I've been given a feature requirement that means processing events on a latency lower than 0.25ms. Meaning I would have to make sure that Spark streaming gets new events from the messaging layer within that period of time. Would anyone have achieve such numbers using a Spark cluster? Or would thi

Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Wang, Ningjun (LNG-NPV)
Hi I ran a spark job and got the following error. Can anybody tell me how to work around this problem? For example how can I increase spark.driver.maxResultSize? Thanks. org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 128 tasks (1029.1 MB)

Re: Spark Streaming - Minimizing batch interval

2015-03-25 Thread Sean Owen
I don't think it's feasible to set a batch interval of 0.25ms. Even at tens of ms the overhead of the framework is a large factor. Do you mean 0.25s = 250ms? Related thoughts, and I don't know if they apply to your case: If you mean, can you just read off the source that quickly? yes. Sometimes

Write Parquet File with spark-streaming with Spark 1.3

2015-03-25 Thread richiesgr
Hi I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't make it with spark 1.3 As in streaming I can't use saveAsParquetFile() because I can't add data to an existing parquet File I know that it's possible to stream data directly into parquet could you help me by providing

Re: Spark as a service

2015-03-25 Thread Irfan Ahmad
You're welcome. How did it go? *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 25, 2015 at 7:53 AM, Ashish Mukherjee < ashish.mukher...@gmail.c

upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is what

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile  — Sent from Mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni wrote: > I have a EC2 cluster created using spark version 1.2.1. > And I have a SBT project . > Now I want to upg

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Even if H2o and ADA are dependent on 1.2.1 , it should be backword compatible, right? So using 1.3 should not break them. And the code is not using the classes from those libs. I tried sbt clean compile .. same errror Thanks _R On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath wrote: > What versio

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
Ah I see now you are trying to use a spark 1.2 cluster - you will need to be running spark 1.3 on your EC2 cluster in order to run programs built against spark 1.3. You will need to terminate and restart your cluster with spark 1.3  — Sent from Mailbox On Wed, Mar 25, 2015 at 6:39 PM, ron

RE: Date and decimal datatype not working

2015-03-25 Thread BASAK, ANANDA
Thanks. This library is only available with Spark 1.3. I am using version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1. So I am using following: val MyDataset = sqlContext.sql("my select query”) MyDataset.map(t => t(0)+"|"+t(1)+"|"+t(2)+"|"+t(3)+"|"+t(4)+"|"+t(5)).sav

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

[no subject]

2015-03-25 Thread Himanish Kushary
Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D)

Recovered state for updateStateByKey and incremental streams processing

2015-03-25 Thread Ravi Reddy
I want to use the "restore from checkpoint" to continue from last accumulated word counts and process new streams of data. This recovery process will keep accurate state of accumulated counters state (calculated by updateStateByKey) after "failure/recovery" or "temp shutdown/upgrade to new code".

python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Guys, I running the following function with spark-submmit and de SO is killing my process : def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() out=self.sqlConte

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html). Please reference the Spark Properties section noting that you can modify these properties via the spark-defaults.conf or via SparkConf(). HTH!

Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables I modified the Hive query but run into same error. ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREAT

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread ๏̯͡๏
I have a YARN cluster where the max memory allowed is 16GB. I set 12G for my driver, however i see OutOFMemory error even for this program http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables . What do you suggest ? On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber wrote: > So,

Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang wrote: > Hi Folks, > > I am trying to run hive context in yarn-cluster mode, but met some error. > Does anybody know what cause the issue. > > I use following cmd to build the distribution: >

Re: OutOfMemory : Java heap space error

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond. On Wed, Jul 9, 2014 at 1:56 AM, Rahul Bhojwani wrote: > Hi, > > My code was running properly but then it suddenly gave this error. Can you > just put some light on it. > > ### > 0 KB, free: 38.7 MB) > 14/07/09 01:46

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df("event") === 0) . select("country", "cnt").as("a"

Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
I solve this by increase the PermGen memory size in driver. -XX:MaxPermSize=512m Thanks. Zhan Zhang On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) mailto:deepuj...@gmail.com>> wrote: I am facing same issue, posted a new thread. Please respond. On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang mailt

Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Michael Armbrust
You should also try increasing the perm gen size: -XX:MaxPermSize=512m On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu wrote: > Can you try giving Spark driver more heap ? > > Cheers > > > > On Mar 25, 2015, at 2:14 AM, Todd Leo wrote: > > Hi, > > I am using *Spark SQL* to query on my *Hive cluster*, f

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
The only way to do "in" using python currently is to use the string based filter API (where you pass us an expression as a string, and we parse it using our SQL parser). from pyspark.sql import Row from pyspark.sql.functions import * df = sc.parallelize([Row(name="test")]).toDF() df.filter("name

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
What's the version of Spark you are running? There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, [1] https://issues.apache.org/jira/browse/SPARK-6055 On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa wrote: > Hi Guys, I running the following function with spark-submmit and de SO is

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
Thanks for the response, I was using IN as an example of the type of operation I need to do. Is there another way to do this that lines up more naturally with the way things are supposed to be done in SparkSQL? On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust wrote: > The only way to do "in" us

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread Matt Silvey
This is a different kind of error. Thomas' OOM error was specific to the kernel refusing to create another thread/process for his application. Matthew On Wed, Mar 25, 2015 at 10:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I have a YARN cluster where the max memory allowed is 16GB. I set 12G for > my driver,

Re: python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1 http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Li

Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi, Thanks for your response. I am not clear about why the query is ambiguous. val both = df_2.join(df_1, df_2("country")===df_1("country"), "left_outer") I thought df_2("country")===df_1("country") indicates that the country field in the 2 dataframes should match and df_2("country") is the equ

Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API. On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa wrote: > Hi Davies, I running 1.1.0. > > Now I'm following this thread that recommend use batchsize parameter = 1 > > > http:/

Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Stuart Layton
I'm trying to save a dataframe to s3 as a parquet file but I'm getting Wrong FS errors >>> df.saveAsParquetFile(parquetFile) 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called with curMem=82744, maxMem=278302556 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 s

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Thanks Dean and Nick. So, I removed the ADAM and H2o from my SBT as I was not using them. I got the code to compile - only for fail while running with - SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache

Re:

2015-03-25 Thread Nathan Kronenfeld
What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary wrote: > Hi, > > I have a RDD of pairs of strings like below : > > (A,B) > (B,C) > (C,D) > (A,D) > (E,F) > (B,F) > > I need to transform/filter this into a RDD of pairs that does

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Thats a good question. In this particular example, it is really only internal implementation details that make it ambiguous. However, fixing this was a very large change so we have defered it to Spark 1.4 and instead print a warning now when you construct trivially equal expressions. I can try t

Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Khandeshi, Ami
I am seeing the same behavior. I have enough resources. How do I resolve it? Thanks, Ami

Re:

2015-03-25 Thread Himanish Kushary
It will only give (A,B). I am generating the pair from combinations of the the strings A,B,C and D, so the pairs (ignoring order) would be (A,B),(A,C),(A,D),(B,C),(B,D),(C,D) On successful filtering using the original condition it will transform to (A,B) and (C,D) On Wed, Mar 25, 2015 at 3:00 PM

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
This will be fixed in Spark 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 and is fixed in master/branch-1.3 if you want to compile from source On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton wrote: > I'm trying to save a dataframe to s3 as a parquet file but I'm getting > Wrong FS err

Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
Until then you can try sql("SET spark.sql.parquet.useDataSourceApi=false") On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust wrote: > This will be fixed in Spark 1.3.1: > https://issues.apache.org/jira/browse/SPARK-6351 > > and is fixed in master/branch-1.3 if you want to compile from source >

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Marcelo Vanzin
The probably means there are not enough free resources in your cluster to run the AM for the Spark job. Check your RM's web ui to see the resources you have available. On Wed, Mar 25, 2015 at 12:08 PM, Khandeshi, Ami wrote: > I am seeing the same behavior. I have enough resources….. How do I re

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
My example is a totally reasonable way to do it, it just requires constructing strings In many cases you can also do it with column objects df[df.name == "test"].collect() Out[15]: [Row(name=u'test')] You should check out: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Weird. Are you running using SBT console? It should have the spark-core jar on the classpath. Similarly, spark-shell or spark-submit should work, but be sure you're using the same version of Spark when running as when compiling. Also, you might need to add spark-sql to your SBT dependencies, but th

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread EcoMotto Inc.
Hello Jeremy, Sorry for the delayed reply! First issue was resolved, I believe it was just production and consumption rate problem. Regarding the second question, I am streaming the data from the file and there are about 38k records. I am sending the streams in the same sequence as I am reading

Re: newbie quesiton - spark with mesos

2015-03-25 Thread Dean Wampler
I think the problem is the use the loopback address: export SPARK_LOCAL_IP=127.0.0.1 In the stack trace from the slave, you see this: ... Reason: Connection refused: localhost/127.0.0.1:51849 akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:5

writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it's bc the file is not closed i.e. ESMinibatchFunctions.writer.close() executes before the stream is created. Here's my code myStream.foreachRDD(rdd => { rdd.foreach(x => {

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
My cluster is still on spark 1.2 and in SBT I am using 1.3. So probably it is compiling with 1.3 but running with 1.2 ? On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler wrote: > Weird. Are you running using SBT console? It should have the spark-core > jar on the classpath. Similarly, spark-shell o

Re: Date and decimal datatype not working

2015-03-25 Thread Dean Wampler
Recall that the input isn't actually read until to do something that forces evaluation, like call saveAsTextFile. You didn't show the whole stack trace here, but it probably occurred while parsing an input line where one of your long fields is actually an empty string. Because this is such a commo

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Yes, that's the problem. The RDD class exists in both binary jar files, but the signatures probably don't match. The bottom line, as always for tools like this, is that you can't mix versions. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

trouble with jdbc df in python

2015-03-25 Thread elliott cordo
if i run the following: db = sqlContext.load("jdbc", url="jdbc:postgresql://localhost/xx", dbtables="mstr.d_customer") i get the error: py4j.protocol.Py4JJavaError: An error occurred while calling o28.load. : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does not exist Seem

Exception Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration

2015-03-25 Thread varvind
Hi,I am running spark in mesos and getting this error. Can anyone help me resolve this?Thanks15/03/25 21:05:00 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exceptionjava.lang.reflect.InvocationTargetExceptionat sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Sour

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
Hi Arunkumar, I think L-BFGS will not work since L-BFGS algorithm assumes that the objective function will be always the same (i.e., the data is the same) for entire optimization process to construct the approximated Hessian matrix. In the streaming case, the data will be changing, so it will caus

  1   2   >