Re: help/suggestions to setup spark cluster

2017-04-26 Thread vincent gromakowski
Spark standalone is not multi tenant you need one clusters per job. Maybe you can try fair scheduling and use one cluster but i doubt it will be prod ready... Le 27 avr. 2017 5:28 AM, "anna stax" a écrit : > Thanks Cody, > > As I already mentioned I am running spark streaming on EC2 cluster in >

Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Thanks Cody, As I already mentioned I am running spark streaming on EC2 cluster in standalone mode. Now in addition to streaming, I want to be able to run spark batch job hourly and adhoc queries using Zeppelin. Can you please confirm that a standalone cluster is OK for this. Please provide me so

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Cody Koeninger
The standalone cluster manager is fine for production. Don't use Yarn or Mesos unless you already have another need for it. On Wed, Apr 26, 2017 at 4:53 PM, anna stax wrote: > Hi Sam, > > Thank you for the reply. > > What do you mean by > I doubt people run spark in a. Single EC2 instance, certa

Re: How to create SparkSession using SparkConf?

2017-04-26 Thread kant kodali
I am using Spark 2.1 BTW. On Wed, Apr 26, 2017 at 3:22 PM, kant kodali wrote: > Hi All, > > I am wondering how to create SparkSession using SparkConf object? Although > I can see that most of the key value pairs we set in SparkConf we can also > set in SparkSession or SparkSession.Builder howev

10th Spark Summit 2017 at Moscone Center

2017-04-26 Thread Jules Damji
Fellow Spark users, The Spark Summit Program Committee requested that I share with this Spark user group few sessions and events they have added this year: Hackathon 1-day and 2-day training courses 3 new tracks: Technical Deep Dive, Streaming and Machine Learning and more… If you planing to at

How to create SparkSession using SparkConf?

2017-04-26 Thread kant kodali
Hi All, I am wondering how to create SparkSession using SparkConf object? Although I can see that most of the key value pairs we set in SparkConf we can also set in SparkSession or SparkSession.Builder however I don't see sparkConf.setJars which is required right? Because we want the driver jar t

Re: help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
Hi Sam, Thank you for the reply. What do you mean by I doubt people run spark in a. Single EC2 instance, certainly not in production I don't think What is wrong in having a data pipeline on EC2 that reads data from kafka, processes using spark and outputs to cassandra? Please explain. Thanks -A

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Sam Elamin
Hi Anna There are a variety of options for launching spark clusters. I doubt people run spark in a. Single EC2 instance, certainly not in production I don't think I don't have enough information of what you are trying to do but if you are just trying to set things up from scratch then I think you

help/suggestions to setup spark cluster

2017-04-26 Thread anna stax
I need to setup a spark cluster for Spark streaming and scheduled batch jobs and adhoc queries. Please give me some suggestions. Can this be done in standalone mode. Right now we have a spark cluster in standalone mode on AWS EC2 running spark streaming application. Can we run spark batch jobs and

Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi, Good progress! Can you remove metastore_db directory and start ./bin/pyspark over? I don't think starting from ~ is necessary. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
have you read http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#kafka-itself On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric wrote: > The reason why I want to obtain this information, i.e. timestamp> tuples is to relate the consumption with the production rates > using

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
Sorry about that, hangouts on air broke in the first one :( On Wed, Apr 26, 2017 at 8:41 AM, Marco Mistroni wrote: > Uh i stayed online in the other link but nobody joinedWill follow > transcript > Kr > > On 26 Apr 2017 9:35 am, "Holden Karau" wrote: > >> And the recording of our discussion

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Dominik Safaric
The reason why I want to obtain this information, i.e. tuples is to relate the consumption with the production rates using the __consumer_offsets Kafka internal topic. Interestedly, the Spark’s KafkaConsumer implementation does not auto commit the offsets upon offset commit expiration, because

Re: weird error message

2017-04-26 Thread Afshin, Bardia
Kicking off the process from ~ directory makes the message go away. I guess the metastore_db created is relative to path of where it’s executed. FIX: kick off from ~ directory ./spark-2.1.0-bin-hadoop2.7/bin/pysark From: "Afshin, Bardia" Date: Wednesday, April 26, 2017 at 9:47 AM To: Jacek Lasko

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
What is it you're actually trying to accomplish? You can get topic, partition, and offset bounds from an offset range like http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets Timestamp isn't really a meaningful idea for a range of offsets. On Tue, Apr 25

Calculate mode separately for multiple columns in row

2017-04-26 Thread Everett Anderson
Hi, One common situation I run across is that I want to compact my data and select the mode (most frequent value) in several columns for each group. Even calculating mode for one column in SQL is a bit tricky. The ways I've seen usually involve a nested sub-select with a group by + count and then

Re: weird error message

2017-04-26 Thread Afshin, Bardia
Thanks for the hint, I don’t think. I thought it’s a permission issue that it cannot read or write to ~/metastore_db but the directory is definitely there drwxrwx--- 5 ubuntu ubuntu 4.0K Apr 25 23:27 metastore_db Just re ran the command from within root spark folder ./bin/pyspark and the same

Last chance: ApacheCon is just three weeks away

2017-04-26 Thread Rich Bowen
ApacheCon is just three weeks away, in Miami, Florida, May 15th - 18th. http://apachecon.com/ There's still time to register and attend. ApacheCon is the best place to find out about tomorrow's software, today. ApacheCon is the official convention of The Apache Software Foundation, and includes t

Re: Spark Testing Library Discussion

2017-04-26 Thread Marco Mistroni
Uh i stayed online in the other link but nobody joinedWill follow transcript Kr On 26 Apr 2017 9:35 am, "Holden Karau" wrote: > And the recording of our discussion is at https://www.youtube.com/ > watch?v=2q0uAldCQ8M > A few of us have follow up things and we will try and do another meeting

Re: Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Subhash Sriram
Hi Devender, I have always gone with the 2nd approach, only so I don't have to chain a bunch of "option()." calls together. You should be able to use either. Thanks, Subhash Sent from my iPhone > On Apr 26, 2017, at 3:26 AM, Devender Yadav > wrote: > > Hi All, > > > I am using Spak 1.6.2

Re: Spark diclines mesos offers

2017-04-26 Thread Pavel Plotnikov
Michael Gummelt, Thanks!!! I'm forgot about debug logging! On Mon, Apr 24, 2017 at 9:30 PM Michael Gummelt wrote: > Have you run with debug logging? There are some hints in the debug logs: > https://github.com/apache/spark/blob/branch-2.1/mesos/src/main/scala/org/apache/spark/scheduler/cluster/

Re: Spark Testing Library Discussion

2017-04-26 Thread Holden Karau
And the recording of our discussion is at https://www.youtube.com/watch?v=2q0uAldCQ8M A few of us have follow up things and we will try and do another meeting in about a month or two :) On Tue, Apr 25, 2017 at 1:04 PM, Holden Karau wrote: > Urgh hangouts did something frustrating, updated link >

Re: Spark-SQL Query Optimization: overlapping ranges

2017-04-26 Thread Jacek Laskowski
explain it and you'll know what happens under the covers. i.e. Use explain on the Dataset. Jacek On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" wrote: > Hello Spark Users! > >Does the Spark Optimization engine reduce overlapping column ranges? > If so, should it push this down to a Data Sourc

Re: weird error message

2017-04-26 Thread Jacek Laskowski
Hi, You've got two spark sessions up and running (and given Spark SQL uses Derby-managed Hive MetaStock hence the issue) Please don't start spark-submit from inside bin. Rather bin/spark-submit... Jacek On 26 Apr 2017 1:57 a.m., "Afshin, Bardia" wrote: I’m having issues when I fire up pyspar

Create dataframe from RDBMS table using JDBC

2017-04-26 Thread Devender Yadav
Hi All, I am using Spak 1.6.2 Which is suitable way to create dataframe from RDBMS table. DataFrame df = sqlContext.read().format("jdbc").options(options).load(); or DataFrame df = sqlContext.read().jdbc(url, table, properties); Regards, Devender ___