spark stages in parallel

2016-02-17 Thread Shushant Arora
can two stages of single job run in parallel in spark? e.g one stage is ,map transformation and another is repartition on mapped rdd. rdd.map(function,100).repartition(30); can it happen that map transformation which is running 100 tasks after few of them say (10 ) are finished and spark starte

Re: Running multiple foreach loops

2016-02-17 Thread radoburansky
Why would you expect performance degradation? On Wed, Feb 17, 2016 at 10:30 PM, Daniel Imberman [via Apache Spark User List] wrote: > Hi all, > > So I'm currently figuring out how to accumulate three separate > accumulators: > > val a:Accumulator > val b:Accumulator > val c:Accumulator > > I hav

Re: New line lost in streaming output file

2016-02-17 Thread Ashutosh Kumar
This works fine with Kafka and Flink. However when I do it with spark , new line feed gets removed. On Tue, Feb 16, 2016 at 4:29 PM, UMESH CHAUDHARY wrote: > Try to print RDD before writing to validate that you are getting '\n' from > Kafka. > > On Tue, Feb 16, 2016 at 4:19 PM, Ashutosh Kumar >

Re: Running multiple foreach loops

2016-02-17 Thread Sabarish Sasidharan
I don't think that's a good idea. Even if it wasn't in Spark. I am trying to understand the benefits you gain by separating. I would rather use a Composite pattern approach wherein adding to the composite cascades the additive operations to the children. Thereby your foreach code doesn't have to k

Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Ted Yu
HBase 1.2.0 is being voted upon. It should come out soon. 1.3.0 may carry backports of features from master branch: MOB, Spark module. Looks like 1.3.0 would come out early summer. On Wed, Feb 17, 2016 at 2:57 PM, Benjamin Kim wrote: > Ted, > > Any idea as to when this will be released? > > Tha

adding a split and union to a streaming application cause big performance hit

2016-02-17 Thread ramach1776
We have a streaming application containing approximately 12 jobs every batch, running in streaming mode (4 sec batches). Each job has several transformations and 1 action (output to cassandra) which causes the execution of the job (DAG) For example the first job, /job 1 ---> receive Stream A -->

Re: Yarn client mode: Setting environment variables

2016-02-17 Thread Saisai Shao
IIUC for example you want to set environment FOO=bar in executor side, you could use "spark.executor.Env.FOO=bar" in conf file, AM will pick this configuration and set as environment variable through container launching. Just list all the envs you want to set in executor side like spark.executor.xx

Opaque error in Spark - Windows

2016-02-17 Thread KhajaAsmath Mohammed
Hi, I am new to Spark and started learning by executing some programs in windows. I am getting below error when executing spark-submit but the program is always running with desired output. worker log always complains the below error with status as FAILED. can you please help me out with this. j

Re: Yarn client mode: Setting environment variables

2016-02-17 Thread Soumya Simanta
Can you give some examples of what variables you are trying to set ? On Thu, Feb 18, 2016 at 1:01 AM, Lin Zhao wrote: > I've been trying to set some environment variables for the spark executors > but haven't had much like. I tried editting conf/spark-env.sh but it > doesn't get through to the

Re: Memory issues on spark

2016-02-17 Thread Jonathan Kelly
(I'm not 100% sure, but...) I think the SPARK_EXECUTOR_* environment variables are intended to be used with Spark Standalone. Even if not, I'd recommend setting the corresponding properties in spark-defaults.conf rather than in spark-env.sh. For example, you may use the following Configuration obj

Re: Memory issues on spark

2016-02-17 Thread Zhang, Jingyu
May set "maximizeResourceAllocation", then EMR will do the best config for you. http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html Jingyu On 18 February 2016 at 12:02, wrote: > Hi All, > > I have been facing memory issues in spark. im using spark-sql on AWS

Re: Memory issues on spark

2016-02-17 Thread Kuchekar
Give your config the cluster manager can only give 2 Executors. Looking at m3.2xlarge --> its is with 30 GB Memory . you have 3 *m3.2xlarge which means you have total of 3 * 30 Gb memory for executor. 15 GB for 16 executor would require 15 * 16 GB. Also check executor the number of core you are s

Memory issues on spark

2016-02-17 Thread ARUN.BONGALE
Hi All, I have been facing memory issues in spark. im using spark-sql on AWS EMR. i have around 50GB file in AWS S3. I want to read this file in BI tool connected to spark-sql on thrift-server over OBDC. I'm executing select * from table in BI tool(qlikview,tableau). I run into OOM error someti

pyspark take function error while count() and collect() are working fine

2016-02-17 Thread Msr Msr
Hi, Please help with below issue: I am reading a large file and trying to display first few lines. But it is able to only display take(1) data. if it is take(2) or more then giving below error message. It is the error with Take function. Both count() and collect() are working fine. The input f

Re: Importing csv files into Hive ORC target table

2016-02-17 Thread Alex Dzhagriev
Hi Mich, You can use data frames ( http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes) to achieve that. val sqlContext = new HiveContext(sc) var rdd = sc.textFile("/data/stg/table2") //... //perform you business logic, cleanups, etc. //... sqlContext.createDataFrame(resu

Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Benjamin Kim
Ted, Any idea as to when this will be released? Thanks, Ben > On Feb 17, 2016, at 2:53 PM, Ted Yu wrote: > > The HBASE JIRA below is for HBase 2.0 > > HBase Spark module would be back ported to hbase 1.3.0 > > FYI > > On Feb 17, 2016, at 1:13 PM, Chandeep Singh

Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Ted Yu
The HBASE JIRA below is for HBase 2.0 HBase Spark module would be back ported to hbase 1.3.0 FYI > On Feb 17, 2016, at 1:13 PM, Chandeep Singh wrote: > > HBase-Spark module was added in 1.3 > > https://issues.apache.org/jira/browse/HBASE-13992 > > http://blog.cloudera.com/blog/2015/08/apach

Re: listening to recursive folder structures in s3 using pyspark streaming (textFileStream)

2016-02-17 Thread Shixiong(Ryan) Zhu
textFileStream doesn't support that. It only supports monitoring one folder. On Wed, Feb 17, 2016 at 7:20 AM, in4maniac wrote: > Hi all, > > I am new to pyspark streaming and I was following a tutorial I saw in the > internet > ( > https://github.com/apache/spark/blob/master/examples/src/main/py

Importing csv files into Hive ORC target table

2016-02-17 Thread Mich Talebzadeh
Hi, We put csv files that are zipped using bzip into a staging are on hdfs In Hive an external table is created as below: DROP TABLE IF EXISTS stg_t2; CREATE EXTERNAL TABLE stg_t2 ( INVOICENUMBER string ,PAYMENTDATE string ,NET string ,VAT string ,TOTAL string ) COMMENT 'from csv file fro

Re: Running multiple foreach loops

2016-02-17 Thread Daniel Imberman
Thank you Ted! On Wed, Feb 17, 2016 at 2:12 PM Ted Yu wrote: > If the Accumulators are updated at the same time, calling foreach() once > seems to have better performance. > > > On Feb 17, 2016, at 4:30 PM, Daniel Imberman > wrote: > > > > Hi all, > > > > So I'm currently figuring out how to ac

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
So suppose I have a bunch of userIds and I need to save them as parquet in database. I also need to load them back and need to be able to do a join on userId. My idea is to partition by userId hashcode first and then on userId. So that I don't have to deal with any performance issues because of a n

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
So suppose I have a bunch of userIds and I need to save them as parquet in database. I also need to load them back and need to be able to do a join on userId. My idea is to partition by userId hashcode first and then on userId. On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust wrote: > Can yo

Re: Running multiple foreach loops

2016-02-17 Thread Ted Yu
If the Accumulators are updated at the same time, calling foreach() once seems to have better performance. > On Feb 17, 2016, at 4:30 PM, Daniel Imberman > wrote: > > Hi all, > > So I'm currently figuring out how to accumulate three separate accumulators: > > val a:Accumulator > val b:Accum

Problem mixing MESOS Cluster Mode and Docker task execution

2016-02-17 Thread g.eynard.bonte...@gmail.com
Hi everybody, I am testing the use of Docker for executing Spark algorithms on MESOS. I managed to execute Spark in client mode with executors inside Docker, but I wanted to go further and have also my Driver running into a Docker Container. Here I ran into a behavior that I'm not sure is normal,

Re: Streaming with broadcast joins

2016-02-17 Thread Sebastian Piu
You should be able to broadcast that data frame using sc.broadcast and join against it. On Wed, 17 Feb 2016, 21:13 Srikanth wrote: > Hello, > > I have a streaming use case where I plan to keep a dataset broadcasted and > cached on each executor. > Every micro batch in streaming will create a DF

Running multiple foreach loops

2016-02-17 Thread Daniel Imberman
Hi all, So I'm currently figuring out how to accumulate three separate accumulators: val a:Accumulator val b:Accumulator val c:Accumulator I have an r:RDD[thing] and the code currently reads r.foreach{ thing => a += thing b += thing c += thing } Idea

Streaming with broadcast joins

2016-02-17 Thread Srikanth
Hello, I have a streaming use case where I plan to keep a dataset broadcasted and cached on each executor. Every micro batch in streaming will create a DF out of the RDD and join the batch. The below code will perform the broadcast operation for each RDD. Is there a way to broadcast it just once?

Re: trouble using Aggregator with DataFrame

2016-02-17 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-13363 On Wed, Feb 17, 2016 at 3:02 PM, Michael Armbrust wrote: > Glad you like it :) > > This sounds like a bug, and we should fix it as we merge DataFrame / > Dataset for 2.0. Could you open JIRA targeted at 2.0? > > On Wed, Feb 17, 2016 at 2:22 PM,

Re: trouble using Aggregator with DataFrame

2016-02-17 Thread Michael Armbrust
Glad you like it :) This sounds like a bug, and we should fix it as we merge DataFrame / Dataset for 2.0. Could you open JIRA targeted at 2.0? On Wed, Feb 17, 2016 at 2:22 PM, Koert Kuipers wrote: > first of all i wanted to say that i am very happy to see > org.apache.spark.sql.expressions.Agg

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread Michael Armbrust
Can you describe what you are trying to accomplish? What would the custom partitioner be? On Tue, Feb 16, 2016 at 1:21 PM, SRK wrote: > Hi, > > How do I use a custom partitioner when I do a saveAsTable in a dataframe. > > > Thanks, > Swetha > > > > -- > View this message in context: > http://ap

Re: cartesian with Dataset

2016-02-17 Thread Michael Armbrust
You will get a cartesian if you do a join/joinWith using lit(true) as the condition. We could consider adding an API for doing that more concisely. On Wed, Feb 17, 2016 at 4:08 AM, Alex Dzhagriev wrote: > Hello all, > > Is anybody aware of any plans to support cartesian for Datasets? Are there

Yarn client mode: Setting environment variables

2016-02-17 Thread Lin Zhao
I've been trying to set some environment variables for the spark executors but haven't had much like. I tried editting conf/spark-env.sh but it doesn't get through to the executors. I'm running 1.6.0 and yarn, any pointer is appreciated. Thanks, Lin

Re: SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-17 Thread Shixiong(Ryan) Zhu
`onApplicationEnd` is posted when SparkContext is stopping, and you cannot submit any job to a stopping SparkContext. In general, SparkListener is used to monitor the job progress and collect job information, an you should not submit jobs there. Why not submit your jobs in the main thread? On Wed,

trouble using Aggregator with DataFrame

2016-02-17 Thread Koert Kuipers
first of all i wanted to say that i am very happy to see org.apache.spark.sql.expressions.Aggregator, it is a neat api, especially when compared to the UDAF/AggregateFunction stuff. its doc/comments says: A base class for user-defined aggregations, which can be used in [[DataFrame]] and [[Dataset]

Re: Spark Streaming with Kafka DirectStream

2016-02-17 Thread Cody Koeninger
You can print whatever you want wherever you want, it's just a question of whether it's going to show up on the driver or the various executors logs On Wed, Feb 17, 2016 at 5:50 AM, Cyril Scetbon wrote: > I don't think we can print an integer value in a spark streaming process > As opposed to a

Re: Spark Streaming with Kafka DirectStream

2016-02-17 Thread Cyril Scetbon
I don't think we can print an integer value in a spark streaming process As opposed to a spark job. I think I can print the content of an rdd but not debug messages. Am I wrong ? Cyril Scetbon > On Feb 17, 2016, at 12:51 AM, ayan guha wrote: > > Hi > > You can always use RDD properties, whi

Getting out of memory error during coalesce

2016-02-17 Thread Anubhav Agarwal
Hi, We have a very small 12 mb file that we join with other data. The job runs fine and save the data as a parquet file. But if we use coalesce(1) we get the following error:- Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAnd

Calender Obj to java.util.date conversion issue

2016-02-17 Thread satish chandra j
HI All, Please find the below snippet referring to UDF which subtracts a day from the given date value *Snippet **:* *val* sub_a_day = udf((d:Date) => {cal.setTime(d) cal.add(Calendar.DATE, -1) cal.getTime() }) *Error *: Exception in thread "main" java.lang.UnsupportedOperati

Why no computations run on workers/slaves in cluster mode?

2016-02-17 Thread Junjie Qian
Hi all, I am new to Spark, and have one problem that, no computations run on workers/slave_servers in the standalone cluster mode. The Spark version is 1.6.0, and environment is CentOS. I run the example codes, e.g. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/s

Re: SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-17 Thread Sumona Routh
Can anyone provide some insight into the flow of SparkListeners, specifically onApplicationEnd? I'm having issues with the SparkContext being stopped before my final processing can complete. Thanks! Sumona On Mon, Feb 15, 2016 at 8:59 AM Sumona Routh wrote: > Hi there, > I am trying to implemen

Re: Error when executing Spark application on YARN

2016-02-17 Thread nsalian
Hi, Thanks for the question. I do see this in the bottom: 16/02/17 15:31:02 ERROR SparkContext: Error initializing SparkContext. Some questions to help get more understanding: 1) Does this happen to any other jobs? 2) Any changes to the Spark setup in recent time? 3) Could you open the trackin

Spark Streaming with Kafka Use Case

2016-02-17 Thread Abhishek Anand
I have a spark streaming application running in production. I am trying to find a solution for a particular use case when my application has a downtime of say 5 hours and is restarted. Now, when I start my streaming application after 5 hours there would be considerable amount of data then in the Ka

Re: Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

2016-02-17 Thread radoburansky
Hi Jerry, How do you know that only 100 messages are inserted? What is the primary key of the "tableOfTopicA" Cassandra table? Isn't it possible that you map more messages to the same primamary key and therefore they overwrite each other in Cassandra? Regards Rado On Tue, Feb 16, 2016 at 10:29

cartesian with Dataset

2016-02-17 Thread Alex Dzhagriev
Hello all, Is anybody aware of any plans to support cartesian for Datasets? Are there any ways to work around this issue without switching to RDDs? Thanks, Alex.

Re: data type transform when creating an RDD object

2016-02-17 Thread Daniel Siegmann
This should do it (for the implementation of your parse method, Google should easily provide information - SimpleDateFormatter is probably what you want): def parseDate(s: String): java.sql.Date = { ... } val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.spli

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-17 Thread Cody Koeninger
Todd's withSessionDo suggestion seems like a better idea. On Wed, Feb 17, 2016 at 12:25 AM, Abhishek Anand wrote: > Hi Cody, > > I am able to do using this piece of code > > kafkaStreamRdd.foreachRDD((rdd,batchMilliSec) -> { > Date currentBatchTime = new Date(); > currentBatchTime.setTime(batchM

Re: How to update data saved as parquet in hdfs using Dataframes

2016-02-17 Thread Arkadiusz Bicz
Hi, Hdfs is append only, that you need to modify it as you read and write in other place. On Wed, Feb 17, 2016 at 2:45 AM, SRK wrote: > Hi, > > How do I update data saved as Parquet in hdfs using dataframes? If I use > SaveMode.Append, it just seems to append the data but does not seem to > upda

Re: Spark Streaming with Kafka Use Case

2016-02-17 Thread Cody Koeninger
Just use a kafka rdd in a batch job or two, then start your streaming job. On Wed, Feb 17, 2016 at 12:57 AM, Abhishek Anand wrote: > I have a spark streaming application running in production. I am trying to > find a solution for a particular use case when my application has a > downtime of say

Re: Worker's BlockManager Folder not getting cleared

2016-02-17 Thread Abhishek Anand
Looking for answer to this. Is it safe to delete the older files using find . -type f -cmin +200 -name "shuffle*" -exec rm -rf {} \; For a window duration of 2 hours how older files can we delete ? Thanks. On Sun, Feb 14, 2016 at 12:34 PM, Abhishek Anand wrote: > Hi All, > > Any ideas on thi

Re: AFTSurvivalRegression Prediction and QuantilProbabilities

2016-02-17 Thread Xiangrui Meng
You can get response, and quantiles if you enables quantilesCol. You can change quantile probabilities as well. There is some example code from the user guide: http://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression. -Xiangrui On Mon, Feb 1, 2016 at 9:09 AM Chris

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread Rishi Mishra
Unfortunately there is not any, at least till 1.5. Have not gone through the new DataSet of 1.6. There is some basic support for Parquet like partitionByColumn. If you want to partition your dataset on a certain way you have to use an RDD to partition & convert that into a DataFrame before stori

Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Chandeep Singh
HBase-Spark module was added in 1.3 https://issues.apache.org/jira/browse/HBASE-13992 http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

Error when executing Spark application on YARN

2016-02-17 Thread alvarobrandon
Hello: I'm trying to launch an application in a yarn cluster with the following command /opt/spark/bin/spark-submit --class com.abrandon.upm.GenerateKMeansData --master yarn --deploy-mode client /opt/spark/BenchMark-1.0-SNAPSHOT.jar kMeans 5 4 5 0.9 8 The last bit after the jar file are

Re: Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

2016-02-17 Thread Jerry
Rado, Yes. you are correct. A lots of messages are created almost in the same time (even use milliseconds). I changed to use "UUID.randomUUID()" with which all messages can be inserted in the Cassandra table without time lag. Thank you very much! Jerry Wong On Wed, Feb 17, 2016 at 1:50 AM, radob

Re: Spark SQL step with many tasks takes a long time to begin processing

2016-02-17 Thread Teng Qiu
for 1) yes, i believe change to hdfs will solve the problem, but i would say changing storage is already a problem... :D 2) you use EMR right? so the spark is running with yarn i believe, then spark.executor.cores is by default 1, if you want to change this, you may need to change yarn settings as

SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Divya Gehlot
Hi, SparkonHBase is integrated with which version of Spark and HBase ? Thanks, Divya

data type transform when creating an RDD object

2016-02-17 Thread Lin, Hao
Hi, Quick question on data type transform when creating RDD object. I want to create a person object with "name" and DOB(date of birth): case class Person(name: String, DOB: java.sql.Date) then I want to create an RDD from a text file without the header, e.g. "name" and "DOB". I have pr

Re: Spark null pointer exception and task failure

2016-02-17 Thread bijuna
The program does not fail consistently. We are executing the spark driver as a standalone Java program. If the cause was a null value in a field, I would assume the program will fail every time we execute it . But, it fails with Nullpointer exception and Task failure only on certain executions

Re: How to update data saved as parquet in hdfs using Dataframes

2016-02-17 Thread Takeshi Yamamuro
HI, Even if you update a few rows, you need to read whole data from parquet, update it, and then save all the data as other new files. On Tue, Feb 16, 2016 at 9:45 PM, SRK wrote: > Hi, > > How do I update data saved as Parquet in hdfs using dataframes? If I use > SaveMode.Append, it just seems

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-17 Thread Steve Loughran
On 17 Feb 2016, at 01:29, Nirav Patel mailto:npa...@xactlycorp.com>> wrote: I think you are not getting my question . I know how to tune executor memory settings and parallelism . That's not an issue. It's a specific question about what happens when physical memory limit of given executor is r

listening to recursive folder structures in s3 using pyspark streaming (textFileStream)

2016-02-17 Thread in4maniac
Hi all, I am new to pyspark streaming and I was following a tutorial I saw in the internet (https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py). But I replaced the data input with an s3 directory path as: lines = ssc.textFileStream("s3n://bucket/f