date:20160404

Detecting application restart when running in supervised cluster mode

2016-04-04 Thread Rafael Barreto

Hello, I have a driver deployed using `spark-submit` in supervised cluster mode. Sometimes my application would die for some transient problem and the restart works perfectly. However, it would be useful to get alerted when that happens. Is there any out-of-the-box way of doing that? Perhaps a hoo

Re: multiple splits fails

2016-04-04 Thread Sachin Aggarwal

Hi Instead of using print() directly on Dstream, I will suggest you use foreachRDD if you wanted to materialize all rows , example shown here:- https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams dstream.foreachRDD(rdd => { val connection =

Timeout in mapWithState

2016-04-04 Thread Abhishek Anand

What exactly is timeout in mapWithState ? I want the keys to get remmoved from the memory if there is no data received on that key for 10 minutes. How can I acheive this in mapWithState ? Regards, Abhi

Can't able to access temp table via jdbc client

2016-04-04 Thread ram kumar

HI, I started a hive thrift server from hive home, ./bin/hiveserver2 opened jdbc client, ./bin/beeline connected to thrift server, 0: > show tables; ++--+ |tab_name| ++--+ | check | | people | +-

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Koert Kuipers

can you try: spark.shuffle.reduceLocality.enabled=false On Mon, Apr 4, 2016 at 8:17 PM, Mike Hynes <91m...@gmail.com> wrote: > Dear all, > > Thank you for your responses. > > Michael Slavitch: > > Just to be sure: Has spark-env.sh and spark-defaults.conf been > correctly propagated to all nodes?

Re: --packages configuration equivalent item name?

2016-04-04 Thread Saisai Shao

spark.jars.ivy, spark.jars.packages, spark.jars.excludes is the configurations you can use. Thanks Saisai On Sun, Apr 3, 2016 at 1:59 AM, Russell Jurney wrote: > Thanks, Andy! > > On Mon, Mar 28, 2016 at 8:44 AM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> Hi Russell >> >> I us

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes

Dear all, Thank you for your responses. Michael Slavitch: > Just to be sure: Has spark-env.sh and spark-defaults.conf been correctly > propagated to all nodes? Are they identical? Yes; these files are stored on a shared memory directory accessible to all nodes. Koert Kuipers: > we ran into si

Standard Deviation in Hive 2 is still incorrect

2016-04-04 Thread Mich Talebzadeh

Hi, I reported back in April 2015 that what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified in Hive 2 Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I did tests with Spark 1.6 as well and Spark correc

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon

The only way I've found to make it work now is by using the current spark context and changing its configuration using spark-shell options. Which is really different from pyspark where you can't instantiate a new one, initialize it etc.. > On Apr 4, 2016, at 18:16, Cyril Scetbon wrote: > > It

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread Ted Yu

bq. I'm on version 2.10 for spark The above is Scala version. Can you give us the Spark version ? Thanks On Mon, Apr 4, 2016 at 2:36 PM, mpawashe wrote: > Hi all, > > I am using Spark Streaming API (I'm on version 2.10 for spark and > streaming), and I am running into a function serialization

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon

It doesn't as you can see : http://pastebin.com/nKcMCtGb I don't need to set the master as I'm using Yarn and I'm on one of the yarn nodes. When I instantiate the Spark Streaming Context with the spark conf, it tries to create a new Spark Context but even with .set("spark.driver.allowMultipleCo

Spark Streaming and Performance

2016-04-04 Thread Mich Talebzadeh

Hi, Having started using Spark Streaming with Kafka it seems that it offers a number of good opportunities. I am considering using for Complex Event Processing (CEP) by building CEP adaptors and Spark transformers. Anyway before going there I would like to see if anyone has done benchmarks on Spa

Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread mpawashe

Hi all, I am using Spark Streaming API (I'm on version 2.10 for spark and streaming), and I am running into a function serialization issue that I do not run into when using Spark in batch (non-streaming) mode. If I wrote code like this: def run(): Unit = { val newStream = stream.map(x => { x

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-04-04 Thread Andy Davidson

Hi Jeff Sorry I did not respond sooner. I was out of town Here is the code I use to initialize the HiveContext # load data set from pyspark.sql import HiveContext #,SQLContext, Row # window functions require HiveContext (spark 2.x will not require hive) #sqlContext = SQLContext(sc) hiveSqlConte

Re: RDDs caching in typical machine learning use cases

2016-04-04 Thread Eugene Morozov

Hi, Yes, I believe people do that. I also believe that SparkML is able to figure out when to cache some internal RDD also. That's definitely true for random forest algo. It doesn't harm to cache the same RDD twice, too. But it's not clear what'd you want to know... -- Be well! Jean Morozov On S

Re: spark-shell failing but pyspark works

2016-04-04 Thread Mich Talebzadeh

Hi Cyril, You can connect to Spark shell from any node. The connection is made to master through --master IP Address like below: spark-shell --master spark://50.140.197.217:7077 Now in the Scala code you can specify something like below: val sparkConf = new SparkConf(). setAppName(

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon

I suppose it doesn't work using spark-shell too ? If you can confirm Thanks > On Apr 3, 2016, at 03:39, Mich Talebzadeh wrote: > > This works fine for me > > val sparkConf = new SparkConf(). > setAppName("StreamTest"). > setMaster("yarn-client"). > set("sp

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-04 Thread Mich Talebzadeh

Actually this may not be a bug. It just the Optimizer decides to do a nested loop join over Hash Join when more that two OR joins are involved With one equality predicate Hash JOin is chosen 4> SELECT COUNT(SALES.PROD_ID) from SALES, SALES2 5> WHERE SALES.CUST_ID = SALES2.CUST_ID 6> go QUERY PLAN

Re: Support for time column type?

2016-04-04 Thread Philip Weaver

Hmm, yeah it looks like I could use that to represent time since start of day. I'm porting existing large SQL queries from Postgres to Spark SQL for a quickPOC, so I'd prefer not to have to make many changes to it. I'm not sure if the CalendarIntervalType can be used as a drop-in replacement (i.e.

Re: All inclusive uber-jar

2016-04-04 Thread Haroon Rasheed

To add to Mich, I put the build.sb under the Myproject root folder : MyProject/buildt.sbt and the assembly.sbt is placed in the folder called "project" under the MyProject folder: MyProject/project/assembly.sbt also the the first line in build.sbt is to import the assembly keys as below: import A

Re: All inclusive uber-jar

2016-04-04 Thread Mich Talebzadeh

Thanks all Actually I have a generic shell script that for a given Scala program creates jar file using Maven or SBT. I modified that one to create a uber jar file as well using SBT assembly. The root directory structure in uber has one more sub-directory. In addition SBT relies on a single sbt

Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001

I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions. I using this code snippet to exec

SparkDriver throwing java.lang.OutOfMemoryError: Java heap space

2016-04-04 Thread Nirav Patel

Hi, We are using spark 1.5.2 and recently hitting this issue after our dataset grew from 140GB to 160GB. Error is thrown during shuffle fetch on reduce side which all should happen on executors and executor should report them! However its gets reported only on driver. SparkContext gets shutdown fr

Fwd: All inclusive uber-jar

2016-04-04 Thread vetal king

-- Forwarded message -- From: vetal king Date: Mon, Apr 4, 2016 at 8:59 PM Subject: Re: All inclusive uber-jar To: Mich Talebzadeh Not sure how to create uber jar using sbt, but this is how you can do it using maven. org.apache.maven.plugins

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu

bq. the modifications do not touch the scheduler If the changes can be ported over to 1.6.1, do you mind reproducing the issue there ? I ask because master branch changes very fast. It would be good to narrow the scope where the behavior you observed started showing. On Mon, Apr 4, 2016 at 6:12

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Michael Slavitch

Just to be sure: Has spark-env.sh and spark-defaults.conf been correctly propagated to all nodes? Are they identical? > On Apr 4, 2016, at 9:12 AM, Mike Hynes <91m...@gmail.com> wrote: > > [ CC'ing dev list since nearly identical questions have occurred in > user list recently w/o resolution;

Re: All inclusive uber-jar

2016-04-04 Thread Marco Mistroni

Hi U can use SBT assembly to create uber jar. U should set spark libraries as 'provided' in ur SBT Hth Marco Ps apologies if by any chances I m telling u something u already know On 4 Apr 2016 2:36 pm, "Mich Talebzadeh" wrote: > Hi, > > > When one builds a project for Spark in this case Spark str

All inclusive uber-jar

2016-04-04 Thread Mich Talebzadeh

Hi, When one builds a project for Spark in this case Spark streaming with SBT, as usual I add dependencies as follows: libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1" libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.6.1" However when I submit it

RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes

[ CC'ing dev list since nearly identical questions have occurred in user list recently w/o resolution; c.f.: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-sing

Failed to get broadcast_1_piece0 of broadcast_1

2016-04-04 Thread Akhilesh Pathodia

Hi, I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in exec

Spark 1.6.1 binary pre-built for Hadoop 2.6 may be broken

2016-04-04 Thread Kousuke Saruta

Hi all, I noticed the binary pre-build for Hadoop 2.6 which we can download from spark.apache.org/downloads.html (Direct Download) may be broken. I couldn't decompress at least following 4 tgzs with "tar xfzv" command and md5-checksum did't match. * spark-1.6.1-bin-hadoop2.6.tgz * spark-1.6.1-bin

Re:

2016-04-04 Thread Akhil Das

1 core with 4 partitions means it executes it one by one, not parallel. For the Kafka question, if you don't have higher data volume then you may not need 40 partitions. Thanks Best Regards On Sat, Apr 2, 2016 at 7:35 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hello, > > As per

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das

I wasn't knowing you have a parquet file containing json data. Thanks Best Regards On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V wrote: > Hi Akhil, > > Thanks for your help. Why do you put separator as "," ? > > I have a parquet file which contains only json in each line. > > *Thanks*, >

Re: Read Parquet in Java Spark

2016-04-04 Thread Ramkumar V

Hi Akhil, Thanks for your help. Why do you put separator as "," ? I have a parquet file which contains only json in each line. *Thanks*, On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das wrote: > Something like this (in scala): > > val rdd = parquetFile.java

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das

Something like this (in scala): val rdd = parquetFile.javaRDD().map(row => row.mkstring(",")) You can create a map operation over your javaRDD to convert the org.apache.spark.sql.Row to String (the Row.mkstring() Operati

Re: How many threads will be used to read RDBMS after set numPartitions =10 in Spark JDBC

2016-04-04 Thread Mich Talebzadeh

This all depends if you provide information to Driver on the underlying RDBMS table and assuming that there is a unique ID on the underlying table you can use to partition the load. Have a look at this http://metricbrew.com/get-data-from-databases-with-apache-spark-jdbc/ HTH Dr Mich Talebzadeh

How many threads will be used to read RDBMS after set numPartitions =10 in Spark JDBC

2016-04-04 Thread Zhang, Jingyu

Hi All, I want read Mysql from Spark. Please let me know how many threads will be used to read the RDBMS after set numPartitions =10 in Spark JDBC. What is the best practice to read large dataset from RDBMS to Spark? Thanks, Jingyu -- This message and its attachments may contain legally privil

Re: Where to set properties for the retainedJobs/Stages?

2016-04-04 Thread Max Schmidt

Okay I put the props in the spark-defaults, but they are not recognized, as they don't appear in the 'Environment' tab during a application execution. spark.eventLog.enabled for example. Am 01.04.2016 um 21:22 schrieb Ted Yu: > Please > read > https://spark.apache.org/docs/latest/configuration.h

38 matches

Mail list logo