Re: Worker node timeout exception

2015-10-01 Thread Mark Luk
Here is the log file from the worker node 15/09/30 23:49:37 INFO Worker: Executor app-20150930233113-/8 finished with state EXITED message Command exited with code 1 exitStatus \ 1 15/09/30 23:49:37 INFO Worker: Asked to launch executor app-20150930233113-/9 for PythonPi 15/09/30 23:49:37

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-10-01 Thread Hemant Bhanawat
As I understand, you don't need merge of your historical data RDD with your RDD_inc, what you need is merge of the computation results of the your historical RDD with RDD_inc and so on. IMO, you should consider having an external row store to hold your computations. I say this because you need to

calling persist would cause java.util.NoSuchElementException: key not found:

2015-10-01 Thread Eyad Sibai
Hi I am trying to call .persist() on a dataframe but once I execute the next line I am getting java.util.NoSuchElementException: key not found: …. I tried to do persist on disk also the same thing. I am using: pyspark with python3 spark 1.5 Thanks! EYAD SIBAI Risk Engineer iZettle ® –

Re: Submitting with --deploy-mode cluster: uploading the jar

2015-10-01 Thread Christophe Schmitz
I am using standalone deployment, with spark 1.4.1 When I submit the job, I get no error at the submission terminal. Then I check the webui, I can find the driver section which has a my driver submission, with this error: java.io.FileNotFoundException ... which point the full path of my jar as it

Calrification on Spark-Hadoop Configuration

2015-10-01 Thread Vinoth Sankar
Hi, I'm new to Spark. For my application I need to overwrite Hadoop configurations (Can't change Configurations in Hadoop as it might affect my regular HDFS), so that Namenode IPs gets automatically resolved.What are the ways to do so. I tried giving "spark.hadoop.dfs.ha.namenodes.nn", "spark.hado

Re: Calrification on Spark-Hadoop Configuration

2015-10-01 Thread Sabarish Sasidharan
You can point to your custom HADOOP_CONF_DIR in your spark-env.sh Regards Sab On 01-Oct-2015 5:22 pm, "Vinoth Sankar" wrote: > Hi, > > I'm new to Spark. For my application I need to overwrite Hadoop > configurations (Can't change Configurations in Hadoop as it might affect my > regular HDFS), so

How to connect HadoopHA from spark

2015-10-01 Thread Vinoth Sankar
Hi, How do i connect HadoopHA from SPARK. I tried overwriting hadoop configurations from sparkCong. But Still I'm getting UnknownHostException with following trace java.lang.IllegalArgumentException: java.net.UnknownHostException: ABC at org.apache.hadoop.security.SecurityUtil.buildTokenService(S

Re: How to connect HadoopHA from spark

2015-10-01 Thread Adam McElwee
Do you have all of the required HDFS HA config options in your override? I think these are the minimum required for HA: dfs.nameservices dfs.ha.namenodes.{nameservice ID} dfs.namenode.rpc-address.{nameservice ID}.{name node ID} On Thu, Oct 1, 2015 at 7:22 AM, Vinoth Sankar wrote: > Hi, > > How

automatic start of streaming job on failure on YARN

2015-10-01 Thread Jeetendra Gangele
We've a streaming application running on yarn and we would like to ensure that is up running 24/7. Is there a way to tell yarn to automatically restart a specific application on failure? There is property yarn.resourcemanager.am.max-attempts which is default set to 2 setting it to bigger value is

Re: How to connect HadoopHA from spark

2015-10-01 Thread Ted Yu
Have you setup HADOOP_CONF_DIR in spark-env.sh correctly ? Cheers On Thu, Oct 1, 2015 at 5:22 AM, Vinoth Sankar wrote: > Hi, > > How do i connect HadoopHA from SPARK. I tried overwriting hadoop > configurations from sparkCong. But Still I'm getting UnknownHostException > with following trace >

Re: Kafka Direct Stream

2015-10-01 Thread Cody Koeninger
You can get the topic for a given partition from the offset range. You can either filter using that; or just have a single rdd and match on topic when doing mapPartitions or foreachPartition (which I think is a better idea) http://spark.apache.org/docs/latest/streaming-kafka-integration.html#appr

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Cody Koeninger
Did you check you kafka broker logs to see what was going on during that time? The direct stream will handle normal leader loss / rebalance by retrying tasks. But the exception you got indicates that something with kafka was wrong, such that offsets were being re-used. ie. your job already proce

Decision Tree Model

2015-10-01 Thread hishamm
Hi, I am using SPARK 1.4.0, Python and Decision Trees to perform machine learning classification. I test it by creating the predictions and zip it to the test data, as following: *predictions = tree_model.predict(test_data.map(lambda a: a.features)) labels = test_data.map(lambda a: a.label).z

Re: Lost leader exception in Kafka Direct for Streaming

2015-10-01 Thread Adrian Tanase
This also happened to me in extreme recovery scenarios – e.g. Killing 4 out of a 7 machine cluster. I’d put my money on recovering from an out of sync replica, although I haven’t done extensive testing around it. -adrian From: Cody Koeninger Date: Thursday, October 1, 2015 at 5:18 PM To: swet

Re: Kafka Direct Stream

2015-10-01 Thread Adrian Tanase
On top of that you could make the topic part of the key (e.g. keyBy in .transform or manually emitting a tuple) and use one of the .xxxByKey operators for the processing. If you have a stable, domain specific list of topics (e.g. 3-5 named topics) and the processing is really different, I would

Re: Deploying spark-streaming application on production

2015-10-01 Thread Jeetendra Gangele
Ya Also I think I need to enable the checkpointing and rather then building the lineage DAG need to store the RDD data into HDFS. On 23 September 2015 at 01:04, Adrian Tanase wrote: > btw I re-read the docs and I want to clarify that reliable receiver + WAL > gives you at least once, not exactly

Re: automatic start of streaming job on failure on YARN

2015-10-01 Thread Adrian Tanase
This happens automatically as long as you submit with cluster mode instead of client mode. (e.g. ./spark-submit —master yarn-cluster …) The property you mention would help right after that, although you will need to set it to a large value (e.g. 1000?) - as there is no “infinite” support. -adri

Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread YaoPau
I'm trying to add multiple SerDe jars to my pyspark session. I got the first one working by changing my PYSPARK_SUBMIT_ARGS to: "--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory 3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar" But when I tried to add a s

Re: Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread Ted Yu
In your second command, have you tried changing the comma to colon ? Cheers On Thu, Oct 1, 2015 at 8:56 AM, YaoPau wrote: > I'm trying to add multiple SerDe jars to my pyspark session. > > I got the first one working by changing my PYSPARK_SUBMIT_ARGS to: > > "--master yarn-client --executor-co

Re: Pyspark: "Error: No main class set in JAR; please specify one with --class"

2015-10-01 Thread Marcelo Vanzin
How are you running the actual application? I find it slightly odd that you're setting PYSPARK_SUBMIT_ARGS directly; that's supposed to be an internal env variable used by Spark. You'd normally pass those parameters in the spark-submit (or pyspark) command line. On Thu, Oct 1, 2015 at 8:56 AM, Ya

How to access lost executor log file

2015-10-01 Thread Lan Jiang
Hi, there When running a Spark job on YARN, 2 executors somehow got lost during the execution. The message on the history server GUI is “CANNOT find address”. Two extra executors were launched by YARN and eventually finished the job. Usually I go to the “Executors” tab on the UI to check the e

Getting spark application driver ID programmatically

2015-10-01 Thread Snehal Nagmote
Hi , I have use case where we need to automate start/stop of spark streaming application. To stop spark job, we need driver/application id of the job . For example : /app/spark-master/bin/spark-class org.apache.spark.deploy.Client kill spark://10.65.169.242:7077 $driver_id I am thinking to get

Re: How to access lost executor log file

2015-10-01 Thread Ted Yu
Can you go to YARN RM UI to find all the attempts for this Spark Job ? The two lost executors should be found there. On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang wrote: > Hi, there > > When running a Spark job on YARN, 2 executors somehow got lost during the > execution. The message on the histor

Accumulator of rows?

2015-10-01 Thread Saif.A.Ellafi
Hi all, I need to repeat a couple rows from a dataframe by n times each. To do so, I plan to create a new Data Frame, but I am being unable to find a way to accumulate "Rows" somewhere, as this might get huge, I can't accumulate into a mutable Array, I think?. Thanks, Saif

Re: How to access lost executor log file

2015-10-01 Thread Lan Jiang
Ted, Thanks for your reply. First of all, after sending email to the mailing list, I use yarn logs applicationId to retrieve the aggregated log successfully. I found the exceptions I am looking for. Now as to your suggestion, when I go to the YARN RM UI, I can only see the "Tracking URL" in t

OOM error in Spark worker

2015-10-01 Thread varun sharma
My workers are going OOM over time. I am running a streaming job in spark 1.4.0. Here is the heap dump of workers. *16,802 instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by "sun.misc.Launcher$AppClassLoader @ 0xdff94088" occupy 488,249,688 (95.80%) bytes. These instance

Re: Problem understanding spark word count execution

2015-10-01 Thread Kartik Mathur
Thanks Nicolae , So In my case all executers are sending results back to the driver and and " *shuffle* *is just sending out the textFile to distribute the partitions", *could you please elaborate on this ? what exactly is in this file ? On Wed, Sep 30, 2015 at 9:57 PM, Nicolae Marasoiu < nicolae

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
Hi Pala, Can you add the full stacktrace of the exception? For now, can you use create temporary function to workaround the issue? Thanks, Yin On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia < mchett...@rocketfuelinc.com.invalid> wrote: > +user list > > On Tue, Sep 29, 2015 at 3:43 PM, Pala M

Shuffle Write v/s Shuffle Read

2015-10-01 Thread Kartik Mathur
Hi I am trying to better understand shuffle in spark . Based on my understanding thus far , *Shuffle Write* : writes stage output for intermediate stage on local disk if memory is not sufficient., Example , if each worker has 200 MB memory for intermediate results and the results are 300MB then

Re: Problem understanding spark word count execution

2015-10-01 Thread Nicolae Marasoiu
Hi, So you say " sc.textFile -> flatMap -> Map". My understanding is like this: First step is a number of partitions are determined, p of them. You can give hint on this. Then the nodes which will load partitions p, that is n nodes (where n<=p). Relatively at the same time or not, the n nodes s

"java.io.IOException: Filesystem closed" on executors

2015-10-01 Thread Lan Jiang
Hi, there Here is the problem I ran into when executing a Spark Job (Spark 1.3). The spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0 library. Then it does some filter/map transformation, repartition to 1 partition and then write to HDFS. It creates 2 stages. The total H

Re: Kafka Direct Stream

2015-10-01 Thread Nicolae Marasoiu
Hi, If you just need processing per topic, why not generate N different kafka direct streams ? when creating a kafka direct stream you have list of topics - just give one. Then the reusable part of your computations should be extractable as transformations/functions and reused between the st

Java REST custom receiver

2015-10-01 Thread Pavol Loffay
Hello, is it possible to implement custom receiver [1] which will receive messages from REST calls? As REST classes in Java(jax-rs) are defined declarative and instantiated by application server I'm not use if it is possible. I have tried to implement custom receiver which is inject to REST clas

Re: Problem understanding spark word count execution

2015-10-01 Thread Kartik Mathur
Hi Nicolae, Thanks for the reply. To further clarify things - sc.textFile is reading from HDFS, now shouldn't the file be read in a way such that EACH executer works on only the local copy of file part available , in this case its a ~ 4.64 GB file and block size is 256MB, so approx 19 partitions w

spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-01 Thread Sourabh Chandak
Hi, I am writing a spark streaming job using the direct stream method for kafka and wanted to handle the case of checkpoint failure when we'll have to reprocess the entire data from starting. By default for every new checkpoint it tries to load everything from each partition and that takes a lot o

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-01 Thread Cody Koeninger
That depends on your job, your cluster resources, the number of seconds per batch... You'll need to do some empirical work to figure out how many messages per batch a given executor can handle. Divide that by the number of seconds per batch. On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak wro

python version in spark-submit

2015-10-01 Thread roy
Hi, We have python2.6 (default) on cluster and also we have installed python2.7. I was looking a way to set python version in spark-submit. anyone know how to do this ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/python-version-in-spark-submi

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Pala M Muthaia
Thanks for getting back Yin. I have copied the stack below. The associated query is just this: "hc.sql("select murmurhash3('abc') from dual")". The UDF murmurhash3 is already available in our hive metastore. Regarding temporary function, can i create a temp function with existing Hive UDF code, no

Re: python version in spark-submit

2015-10-01 Thread Ted Yu
PYSPARK_PYTHON determines what the worker uses. PYSPARK_DRIVER_PYTHON is for driver. See the comment at the beginning of bin/pyspark FYI On Thu, Oct 1, 2015 at 1:56 PM, roy wrote: > Hi, > > We have python2.6 (default) on cluster and also we have installed > python2.7. > > I was looking a way

SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-01 Thread haridass saisriram
Hi, I am trying to find a simple example to read a data file on HDFS. The file has the following format a , b , c ,,mm a1,b1,c1,2015,09 a2,b2,c2,2014,08 I would like to read this file and store it in HDFS partitioned by year and month. Something like this /path/to/hdfs//mm I want to

Re: Java REST custom receiver

2015-10-01 Thread Silvio Fiorito
When you say “receive messages” you mean acting as a REST endpoint, right? If so, it might be better to use JMS (or Kafka) option for a few reasons: The receiver will be deployed to any of the available executors, so your REST clients will need to be made aware of the IP where the receiver is ru

Call Site - Spark Context

2015-10-01 Thread Sandip Mehta
Hi, I wanted to understand what is the purpose of Call Site in Spark Context? Regards SM - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

How to save DataFrame as a Table in Hbase?

2015-10-01 Thread unk1102
Hi anybody tried to save DataFrame in HBase? I have processed data in DataFrame which I need to store in HBase so that my web ui can access it from Hbase? Please guide. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-DataFrame-

Re: Standalone Scala Project

2015-10-01 Thread Robineast
I've eyeballed the sbt file and it look ok to me Try sbt clean package that should sort it out. If not please supply the full code you are running - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-act

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
Yes. You can use create temporary function to create a function based on a Hive UDF ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction ). Regarding the error, I think the problem is that starting from Spark 1.4, we have two separate H

How to Set Retry Policy in Spark

2015-10-01 Thread Renxia Wang
Hi guys, I know there is a way to set the number of retry of failed tasks, using spark.task.maxFailures. what is the default policy for the failed tasks retry? Is it exponential backoff? My tasks sometimes failed because of Socket connection timeout/reset, even with retry, some of the tasks will f

Re: Call Site - Spark Context

2015-10-01 Thread Sandip Mehta
Thanks Robin. Regards SM > On 01-Oct-2015, at 3:15 pm, Robin East wrote: > > From the comments in the code: > > When called inside a class in the spark package, returns the name of the user > code class (outside the spark package) that called into Spark, as well as > which Spark method they

Re: Call Site - Spark Context

2015-10-01 Thread Sandip Mehta
Thanks Robin. Regards SM > On 01-Oct-2015, at 3:15 pm, Robin East wrote: > > From the comments in the code: > > When called inside a class in the spark package, returns the name of the user > code class (outside the spark package) that called into Spark, as well as > which Spark method they

Spark cluster - use machine name in WorkerID, not IP address

2015-10-01 Thread markluk
I'm running a standalone Spark cluster of 1 master and 2 slaves. My slaves file under /conf list the fully qualified domain name of the 2 slave machines When I look on the Spark webpage ( on :8080), I see my 2 workers, but the worker ID uses the IP address , like worker-20151001153012-172.31.51.1

Re: spark.mesos.coarse impacts memory performance on mesos

2015-10-01 Thread Utkarsh Sengar
Bumping it up, its not really a blocking issue. But fine grain mode eats up uncertain number of resources in mesos and launches tons of tasks, so I would prefer using the coarse grained mode if only it didn't run out of memory. Thanks, -Utkarsh On Mon, Sep 28, 2015 at 2:24 PM, Utkarsh Sengar wro

Re: spark.mesos.coarse impacts memory performance on mesos

2015-10-01 Thread Tim Chen
Hi Utkarsh, I replied earlier asking what is your task assignment like with fine vs coarse grain mode look like? Tim On Thu, Oct 1, 2015 at 4:05 PM, Utkarsh Sengar wrote: > Bumping it up, its not really a blocking issue. > But fine grain mode eats up uncertain number of resources in mesos and

Re: How to Set Retry Policy in Spark

2015-10-01 Thread Renxia Wang
Additional Info: I am running Spark on YARN. 2015-10-01 15:42 GMT-07:00 Renxia Wang : > Hi guys, > > I know there is a way to set the number of retry of failed tasks, using > spark.task.maxFailures. what is the default policy for the failed tasks > retry? Is it exponential backoff? My tasks somet

Re: spark.mesos.coarse impacts memory performance on mesos

2015-10-01 Thread Utkarsh Sengar
Not sure what you mean by that, I shared the data which I see in spark UI. Can you point me to a location where I can precisely get the data you need? When I run the job in fine grained mode, I see tons are tasks created and destroyed under a mesos "framework". I have about 80k spark tasks which I

RE: Problem understanding spark word count execution

2015-10-01 Thread java8964
I am not sure about originally explain of shuffle write. In the word count example, the shuffle is needed, as Spark has to group by the word (ReduceBy is more accurate here). Image that you have 2 mappers to read the data, then each mapper will generate the (word, count) tuple output in segment

Re: Spark streaming job filling a lot of data in local spark nodes

2015-10-01 Thread swetha kasireddy
We have limited disk space. So, can we have spark.cleaner.ttl to clean up the files? Or is there any setting that can cleanup old temp files? On Mon, Sep 28, 2015 at 7:02 PM, Shixiong Zhu wrote: > These files are created by shuffle and just some temp files. They are not > necessary for checkpoin

[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1 http://spark.apach

Re: How to access lost executor log file

2015-10-01 Thread Ted Yu
Looks like the spark history server should take the lost exectuors into account by analyzing the output from 'yarn logs applicationId' command. Cheers On Thu, Oct 1, 2015 at 11:46 AM, Lan Jiang wrote: > Ted, > > Thanks for your reply. > > First of all, after sending email to the mailing list,

spark-submit --packages using different resolver

2015-10-01 Thread Jerry Lam
Hi spark users and developers, I'm trying to use spark-submit --packages against private s3 repository. With sbt, I'm using fm-sbt-s3-resolver with proper aws s3 credentials. I wonder how can I add this resolver into spark-submit such that --packages can resolve dependencies from private repo? Th

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-01 Thread Nicolae Marasoiu
Hi, Set 10ms and spark.streaming.backpressure.enabled=true This should automatically delay the next batch until the current one is processed, or at least create that balance over a few batches/periods between the consume/process rate vs ingestion rate. Nicu

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-01 Thread Sourabh Chandak
Thanks Cody, will try to do some estimation. Thanks Nicolae, will try out this config. Thanks, Sourabh On Thu, Oct 1, 2015 at 11:01 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > Set 10ms and spark.streaming.backpressure.enabled=true > > > This should automatically dela

Checkpointing is super slow

2015-10-01 Thread Sourabh Chandak
Hi, I have a receiverless kafka streaming job which was started yesterday evening and was running fine till 4 PM today. Suddenly post that writing of checkpoint has slowed down and it is now not able to catch up with the incoming data. We are using the DSE stack with Spark 1.2 and Cassandra for ch

Re: How to save DataFrame as a Table in Hbase?

2015-10-01 Thread Nicolae Marasoiu
Hi, Phoenix, an SQL coprocessor for HBase has ingestion integration with dataframes in 4.x version. For HBase and RDD in general there are multiple solutions: hbase-spark module by Cloudera, which wil be part of a future HBase release, hbase-rdd by unicredit, and many others. I am not sure if t

Re: calling persist would cause java.util.NoSuchElementException: key not found:

2015-10-01 Thread Shixiong Zhu
Do you have the full stack trace? Could you check if it's same as https://issues.apache.org/jira/browse/SPARK-10422 Best Regards, Shixiong Zhu 2015-10-01 17:05 GMT+08:00 Eyad Sibai : > Hi > > I am trying to call .persist() on a dataframe but once I execute the next > line I am getting > java.uti

Addition of Meetup Group - Sydney, Mebourne Australia

2015-10-01 Thread Andy Huang
Hi, Could someone please help with adding the following Spark Meetup Groups to the Meetups section of http://spark.apache.org/community.html Sydney Spark Meetup Group: http://www.meetup.com/Sydney-Apache-Spark-User-Group/ Melbourne Spark Meetup Group: http://www.meetup.com/Melbourne-Apache-Spark-