Detecting application restart when running in supervised cluster mode

2016-04-04 Thread Rafael Barreto
Hello, I have a driver deployed using `spark-submit` in supervised cluster mode. Sometimes my application would die for some transient problem and the restart works perfectly. However, it would be useful to get alerted when that happens. Is there any out-of-the-box way of doing that? Perhaps a hoo

Re: multiple splits fails

2016-04-04 Thread Sachin Aggarwal
Hi Instead of using print() directly on Dstream, I will suggest you use foreachRDD if you wanted to materialize all rows , example shown here:- https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams dstream.foreachRDD(rdd => { val connection =

Timeout in mapWithState

2016-04-04 Thread Abhishek Anand
What exactly is timeout in mapWithState ? I want the keys to get remmoved from the memory if there is no data received on that key for 10 minutes. How can I acheive this in mapWithState ? Regards, Abhi

Can't able to access temp table via jdbc client

2016-04-04 Thread ram kumar
HI, I started a hive thrift server from hive home, ./bin/hiveserver2 opened jdbc client, ./bin/beeline connected to thrift server, 0: > show tables; ++--+ |tab_name| ++--+ | check | | people | +-

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Koert Kuipers
can you try: spark.shuffle.reduceLocality.enabled=false On Mon, Apr 4, 2016 at 8:17 PM, Mike Hynes <91m...@gmail.com> wrote: > Dear all, > > Thank you for your responses. > > Michael Slavitch: > > Just to be sure: Has spark-env.sh and spark-defaults.conf been > correctly propagated to all nodes?

Re: --packages configuration equivalent item name?

2016-04-04 Thread Saisai Shao
spark.jars.ivy, spark.jars.packages, spark.jars.excludes is the configurations you can use. Thanks Saisai On Sun, Apr 3, 2016 at 1:59 AM, Russell Jurney wrote: > Thanks, Andy! > > On Mon, Mar 28, 2016 at 8:44 AM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> Hi Russell >> >> I us

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
Dear all, Thank you for your responses. Michael Slavitch: > Just to be sure: Has spark-env.sh and spark-defaults.conf been correctly > propagated to all nodes? Are they identical? Yes; these files are stored on a shared memory directory accessible to all nodes. Koert Kuipers: > we ran into si

Standard Deviation in Hive 2 is still incorrect

2016-04-04 Thread Mich Talebzadeh
Hi, I reported back in April 2015 that what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified in Hive 2 Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I did tests with Spark 1.6 as well and Spark correc

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
The only way I've found to make it work now is by using the current spark context and changing its configuration using spark-shell options. Which is really different from pyspark where you can't instantiate a new one, initialize it etc.. > On Apr 4, 2016, at 18:16, Cyril Scetbon wrote: > > It

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread Ted Yu
bq. I'm on version 2.10 for spark The above is Scala version. Can you give us the Spark version ? Thanks On Mon, Apr 4, 2016 at 2:36 PM, mpawashe wrote: > Hi all, > > I am using Spark Streaming API (I'm on version 2.10 for spark and > streaming), and I am running into a function serialization

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
It doesn't as you can see : http://pastebin.com/nKcMCtGb I don't need to set the master as I'm using Yarn and I'm on one of the yarn nodes. When I instantiate the Spark Streaming Context with the spark conf, it tries to create a new Spark Context but even with .set("spark.driver.allowMultipleCo

Spark Streaming and Performance

2016-04-04 Thread Mich Talebzadeh
Hi, Having started using Spark Streaming with Kafka it seems that it offers a number of good opportunities. I am considering using for Complex Event Processing (CEP) by building CEP adaptors and Spark transformers. Anyway before going there I would like to see if anyone has done benchmarks on Spa

Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread mpawashe
Hi all, I am using Spark Streaming API (I'm on version 2.10 for spark and streaming), and I am running into a function serialization issue that I do not run into when using Spark in batch (non-streaming) mode. If I wrote code like this: def run(): Unit = { val newStream = stream.map(x => { x

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-04-04 Thread Andy Davidson
Hi Jeff Sorry I did not respond sooner. I was out of town Here is the code I use to initialize the HiveContext # load data set from pyspark.sql import HiveContext #,SQLContext, Row # window functions require HiveContext (spark 2.x will not require hive) #sqlContext = SQLContext(sc) hiveSqlConte

Re: RDDs caching in typical machine learning use cases

2016-04-04 Thread Eugene Morozov
Hi, Yes, I believe people do that. I also believe that SparkML is able to figure out when to cache some internal RDD also. That's definitely true for random forest algo. It doesn't harm to cache the same RDD twice, too. But it's not clear what'd you want to know... -- Be well! Jean Morozov On S

Re: spark-shell failing but pyspark works

2016-04-04 Thread Mich Talebzadeh
Hi Cyril, You can connect to Spark shell from any node. The connection is made to master through --master IP Address like below: spark-shell --master spark://50.140.197.217:7077 Now in the Scala code you can specify something like below: val sparkConf = new SparkConf(). setAppName(

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
I suppose it doesn't work using spark-shell too ? If you can confirm Thanks > On Apr 3, 2016, at 03:39, Mich Talebzadeh wrote: > > This works fine for me > > val sparkConf = new SparkConf(). > setAppName("StreamTest"). > setMaster("yarn-client"). > set("sp

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-04 Thread Mich Talebzadeh
Actually this may not be a bug. It just the Optimizer decides to do a nested loop join over Hash Join when more that two OR joins are involved With one equality predicate Hash JOin is chosen 4> SELECT COUNT(SALES.PROD_ID) from SALES, SALES2 5> WHERE SALES.CUST_ID = SALES2.CUST_ID 6> go QUERY PLAN

Re: Support for time column type?

2016-04-04 Thread Philip Weaver
Hmm, yeah it looks like I could use that to represent time since start of day. I'm porting existing large SQL queries from Postgres to Spark SQL for a quickPOC, so I'd prefer not to have to make many changes to it. I'm not sure if the CalendarIntervalType can be used as a drop-in replacement (i.e.

Re: All inclusive uber-jar

2016-04-04 Thread Haroon Rasheed
To add to Mich, I put the build.sb under the Myproject root folder : MyProject/buildt.sbt and the assembly.sbt is placed in the folder called "project" under the MyProject folder: MyProject/project/assembly.sbt also the the first line in build.sbt is to import the assembly keys as below: import A

Re: All inclusive uber-jar

2016-04-04 Thread Mich Talebzadeh
Thanks all Actually I have a generic shell script that for a given Scala program creates jar file using Maven or SBT. I modified that one to create a uber jar file as well using SBT assembly. The root directory structure in uber has one more sub-directory. In addition SBT relies on a single sbt

Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001
I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions. I using this code snippet to exec

SparkDriver throwing java.lang.OutOfMemoryError: Java heap space

2016-04-04 Thread Nirav Patel
Hi, We are using spark 1.5.2 and recently hitting this issue after our dataset grew from 140GB to 160GB. Error is thrown during shuffle fetch on reduce side which all should happen on executors and executor should report them! However its gets reported only on driver. SparkContext gets shutdown fr

Fwd: All inclusive uber-jar

2016-04-04 Thread vetal king
-- Forwarded message -- From: vetal king Date: Mon, Apr 4, 2016 at 8:59 PM Subject: Re: All inclusive uber-jar To: Mich Talebzadeh Not sure how to create uber jar using sbt, but this is how you can do it using maven. org.apache.maven.plugins

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu
bq. the modifications do not touch the scheduler If the changes can be ported over to 1.6.1, do you mind reproducing the issue there ? I ask because master branch changes very fast. It would be good to narrow the scope where the behavior you observed started showing. On Mon, Apr 4, 2016 at 6:12

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Michael Slavitch
Just to be sure: Has spark-env.sh and spark-defaults.conf been correctly propagated to all nodes? Are they identical? > On Apr 4, 2016, at 9:12 AM, Mike Hynes <91m...@gmail.com> wrote: > > [ CC'ing dev list since nearly identical questions have occurred in > user list recently w/o resolution;

Re: All inclusive uber-jar

2016-04-04 Thread Marco Mistroni
Hi U can use SBT assembly to create uber jar. U should set spark libraries as 'provided' in ur SBT Hth Marco Ps apologies if by any chances I m telling u something u already know On 4 Apr 2016 2:36 pm, "Mich Talebzadeh" wrote: > Hi, > > > When one builds a project for Spark in this case Spark str

All inclusive uber-jar

2016-04-04 Thread Mich Talebzadeh
Hi, When one builds a project for Spark in this case Spark streaming with SBT, as usual I add dependencies as follows: libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1" libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.6.1" However when I submit it

RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
[ CC'ing dev list since nearly identical questions have occurred in user list recently w/o resolution; c.f.: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-sing

Failed to get broadcast_1_piece0 of broadcast_1

2016-04-04 Thread Akhilesh Pathodia
Hi, I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in exec

Spark 1.6.1 binary pre-built for Hadoop 2.6 may be broken

2016-04-04 Thread Kousuke Saruta
Hi all, I noticed the binary pre-build for Hadoop 2.6 which we can download from spark.apache.org/downloads.html (Direct Download) may be broken. I couldn't decompress at least following 4 tgzs with "tar xfzv" command and md5-checksum did't match. * spark-1.6.1-bin-hadoop2.6.tgz * spark-1.6.1-bin

Re:

2016-04-04 Thread Akhil Das
1 core with 4 partitions means it executes it one by one, not parallel. For the Kafka question, if you don't have higher data volume then you may not need 40 partitions. Thanks Best Regards On Sat, Apr 2, 2016 at 7:35 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hello, > > As per

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das
I wasn't knowing you have a parquet file containing json data. Thanks Best Regards On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V wrote: > Hi Akhil, > > Thanks for your help. Why do you put separator as "," ? > > I have a parquet file which contains only json in each line. > > *Thanks*, >

Re: Read Parquet in Java Spark

2016-04-04 Thread Ramkumar V
Hi Akhil, Thanks for your help. Why do you put separator as "," ? I have a parquet file which contains only json in each line. *Thanks*, On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das wrote: > Something like this (in scala): > > val rdd = parquetFile.java

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das
Something like this (in scala): val rdd = parquetFile.javaRDD().map(row => row.mkstring(",")) You can create a map operation over your javaRDD to convert the org.apache.spark.sql.Row to String (the Row.mkstring() Operati

Re: How many threads will be used to read RDBMS after set numPartitions =10 in Spark JDBC

2016-04-04 Thread Mich Talebzadeh
This all depends if you provide information to Driver on the underlying RDBMS table and assuming that there is a unique ID on the underlying table you can use to partition the load. Have a look at this http://metricbrew.com/get-data-from-databases-with-apache-spark-jdbc/ HTH Dr Mich Talebzadeh

How many threads will be used to read RDBMS after set numPartitions =10 in Spark JDBC

2016-04-04 Thread Zhang, Jingyu
Hi All, I want read Mysql from Spark. Please let me know how many threads will be used to read the RDBMS after set numPartitions =10 in Spark JDBC. What is the best practice to read large dataset from RDBMS to Spark? Thanks, Jingyu -- This message and its attachments may contain legally privil

Re: Where to set properties for the retainedJobs/Stages?

2016-04-04 Thread Max Schmidt
Okay I put the props in the spark-defaults, but they are not recognized, as they don't appear in the 'Environment' tab during a application execution. spark.eventLog.enabled for example. Am 01.04.2016 um 21:22 schrieb Ted Yu: > Please > read > https://spark.apache.org/docs/latest/configuration.h