Dealing with failures

2016-06-07 Thread Mohit Anchlia
I am looking to write an ETL job using spark that reads data from the source, perform transformation and insert it into the destination. I am trying to understand how spark deals with failures? I can't seem to find the documentation. I am interested in learning the following scenarios: 1. Source be

Re: Dealing with failures

2016-06-08 Thread Mohit Anchlia
On Wed, Jun 8, 2016 at 3:42 AM, Jacek Laskowski wrote: > On Wed, Jun 8, 2016 at 2:38 AM, Mohit Anchlia > wrote: > > I am looking to write an ETL job using spark that reads data from the > > source, perform transformation and insert it into the destination. > > Is this

Write Ahead Log

2016-06-08 Thread Mohit Anchlia
Is something similar to park.streaming.receiver.writeAheadLog.enable available on SparkContext? It looks like it only works for spark streaming.

Re: Write Ahead Log

2016-06-08 Thread Mohit Anchlia
Wed, Jun 8, 2016 at 3:14 PM, Mohit Anchlia > wrote: > >> Is something similar to park.streaming.receiver.writeAheadLog.enable >> available on SparkContext? It looks like it only works for spark streaming. >> > >

SparkR

2015-07-27 Thread Mohit Anchlia
Does SparkR support all the algorithms that R library supports?

Checkpoint Dir Error in Yarn

2015-08-07 Thread Mohit Anchlia
I am running in yarn-client mode and trying to execute network word count example. When I connect through nc I see the following in spark app logs: Exception in thread "main" java.lang.AssertionError: assertion failed: The checkpoint directory has not been set. Please use StreamingContext.checkpoi

Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am trying to run this program as a yarn-client. The job seems to be submitting successfully however I don't see any process listening on this host on port https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.j

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
are running on a cluster, the listening is occurring on one of the > executors, not in the driver. > > On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia > wrote: > >> I am trying to run this program as a yarn-client. The job seems to be >> submitting successfully however I don&

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
park-submit. The executors are running in the YARN cluster nodes, and the > socket receiver listening on port is running in one of the executors. > > On Mon, Aug 10, 2015 at 11:43 AM, Mohit Anchlia > wrote: > >> I am running as a yarn-client which probably means that the progr

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
submit your app in client mode on that to see whether you are seeing the > process listening on or not. > > On Mon, Aug 10, 2015 at 12:43 PM, Mohit Anchlia > wrote: > >> I've verified all the executors and I don't see a process listening on >> the port. Ho

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I do see this message: 15/08/10 19:19:12 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources On Mon, Aug 10, 2015 at 4:15 PM, Mohit Anchlia wrote: > > I am using the same exac

ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
I am seeing following error. I think it's not able to find some other associated classes as I see "$2" in the exception, but not sure what I am missing. 15/08/11 16:00:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 50, ip-10-241-251-141.us-west-2.compute.internal): java.lang.Cl

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
. Please note I am using it in yarn On Tue, Aug 11, 2015 at 1:52 PM, Mohit Anchlia wrote: > I am seeing following error. I think it's not able to find some other > associated classes as I see "$2" in the exception, but not sure what I am > missing. > > > 15/08

Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc.

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
After changing the '--deploy_mode client' the program seems to work however it looks like there is a bug in spark when using --deploy_mode as 'yarn'. Should I open a bug? On Tue, Aug 11, 2015 at 3:02 PM, Mohit Anchlia wrote: > I see the following line in the log &

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
partition too much. > > Best > Ayan > > On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia > wrote: > >> How does partitioning in spark work when it comes to streaming? What's >> the best way to partition a time series data grouped by a certain tag like >> cate

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
generated during the > batchInterval are partitions of the RDD. > > Now if you want to repartition based on a key, a shuffle is needed. > > On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia > wrote: > >> How does partitioning in spark work when it comes to streaming? What's &g

Unit Testing

2015-08-12 Thread Mohit Anchlia
Is there a way to run spark streaming methods in standalone eclipse environment to test out the functionality?

Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I have this call trying to save to hdfs 2.6 wordCounts.saveAsNewAPIHadoopFiles("prefix", "txt"); but I am getting the following: java.lang.RuntimeException: class scala.runtime.Nothing$ not org.apache.hadoop.mapreduce.OutputFormat

Re: Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I was able to get this working by using an alternative method however I only see 0 bytes files in hadoop. I've verified that the output does exist in the logs however it's missing from hdfs. On Thu, Aug 13, 2015 at 10:49 AM, Mohit Anchlia wrote: > I have this call trying to sav

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
ee the data: $ ls -ltr !$ ls -ltr /tmp/out -rw-r--r-- 1 yarn yarn 5230 Aug 13 15:45 /tmp/out On Fri, Aug 14, 2015 at 6:15 AM, Ted Yu wrote: > Which Spark release are you using ? > > Can you show us snippet of your code ? > > Have you checked namenode log ? > > Thanks >

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
prefix: String, > suffix: String, > keyClass: Class[_], > valueClass: Class[_], > outputFormatClass: Class[F]) { > > Did you intend to use outputPath as prefix ? > > Cheers > > > On Fri, Aug 14, 2015 at 1:36 PM, Mohit Anchlia > wrote: > &

Executors on multiple nodes

2015-08-14 Thread Mohit Anchlia
I am running on Yarn and do have a question on how spark runs executors on different data nodes. Is that primarily decided based on number of receivers? What do I need to do to ensure that multiple nodes are being used for data processing?

Too many files/dirs in hdfs

2015-08-14 Thread Mohit Anchlia
Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have 2 concerns here: 1) Extra unnecessary files is being created from the output 2) Hadoop doesn't work really well with too many files and I see that it is creating a directory with a timestamp every 1 second. Is

Re: Too many files/dirs in hdfs

2015-08-18 Thread Mohit Anchlia
er-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167> >> or have a separate program which will do the clean up for you. >> >> Thanks >> Best Regards >> >> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia >> wr

Re: Too many files/dirs in hdfs

2015-08-19 Thread Mohit Anchlia
applying your filters > 3) Write this StringBuilder to File when you want to write (The duration > can be defined as a condition) > > On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia > wrote: > >> Is there a way to store all the results in one file and keep the file >

Re: Too many files/dirs in hdfs

2015-08-24 Thread Mohit Anchlia
Any help would be appreciated On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia wrote: > My question was how to do this in Hadoop? Could somebody point me to some > examples? > > On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY > wrote: > >> Of course, Java or Scala can

Re: Too many files/dirs in hdfs

2015-08-25 Thread Mohit Anchlia
file, which seems like a extra overhead from maintenance and IO perspective. On Mon, Aug 24, 2015 at 2:51 PM, Mohit Anchlia wrote: > Any help would be appreciated > > On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia > wrote: > >> My question was how to do this in Hadoop?

Spark directory partition name

2017-10-16 Thread Mohit Anchlia
When spark writes the partition it writes in the format as: =/key value> Is there a way to have spark write only by keyvalue?

Null array of cols

2017-10-24 Thread Mohit Anchlia
I am trying to understand the best way to handle the scenario where null array "[]" is passed. Can somebody suggest if there is a way to filter out such records. I've tried numerous things including using dataframe.head().isEmpty but pyspark doesn't recognize isEmpty even though I see it in the API

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
ect classpath. > > On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia > wrote: > > I am trying out streaming example as documented and I am using spark > 1.2.1 > > streaming from maven for Java. > > > > When I add this code I get compilation error on and eclipse is not

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
is indeed insufficient and you have to > download all the recursive dependencies. May be you should create a Maven > project inside Eclipse? > > TD > > On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia > wrote: > >> How do I do that? I haven't used Scala bef

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
Scala deps transitively and can import scala.* > classes. However, it would be a little bit more correct to depend > directly on the scala library classes, but in practice, easiest not to > in simple use cases. > > If you're still having trouble look at the output of "mvn depende

Re: Compilation error

2015-03-10 Thread Mohit Anchlia
f I should delete that file from local repo and re-sync On Tue, Mar 10, 2015 at 1:08 PM, Mohit Anchlia wrote: > I ran the dependency command and see the following dependencies: > > I only see org.scala-lang. > > [INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT > > [

Compilation error on JavaPairDStream

2015-03-10 Thread Mohit Anchlia
I am getting following error. When I look at the sources it seems to be a scala source, but not sure why it's complaining about it. The method map(Function) in the type JavaDStream is not applicable for the arguments (new PairFunction(){}) And my code has been taken from the spark examples site

Hadoop Map vs Spark stream Map

2015-03-10 Thread Mohit Anchlia
Hi, I am trying to understand Hadoop Map method compared to spark Map and I noticed that spark Map only receives 3 arguments 1) input value 2) output key 3) output value, however in hadoop map it has 4 values 1) input key 2) input value 3) output key 4) output value. Is there any reason it was des

Re: Compilation error on JavaPairDStream

2015-03-10 Thread Mohit Anchlia
works now. I should have checked :) On Tue, Mar 10, 2015 at 1:44 PM, Sean Owen wrote: > Ah, that's a typo in the example: use words.mapToPair > I can make a little PR to fix that. > > On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia > wrote: > > I am getting following

SQL with Spark Streaming

2015-03-10 Thread Mohit Anchlia
Does Spark Streaming also supports SQLs? Something like how Esper does CEP.

Architecture Documentation

2015-03-11 Thread Mohit Anchlia
Is there a good architecture doc that gives a sufficient overview of high level and low level details of spark with some good diagrams?

Compilation error

2015-03-12 Thread Mohit Anchlia
I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any "import scala.Tuple2" class. http://spark.apache.org/docs/1.2.0/streaming-pr

Re: Compilation error

2015-03-12 Thread Mohit Anchlia
It works after sync, thanks for the pointers On Tue, Mar 10, 2015 at 1:22 PM, Mohit Anchlia wrote: > I navigated to maven dependency and found scala library. I also found > Tuple2.class and when I click on it in eclipse I get "invalid LOC header > (bad signature)" > > j

Partitioning

2015-03-13 Thread Mohit Anchlia
I am trying to look for a documentation on partitioning, which I can't seem to find. I am looking at spark streaming and was wondering how does it partition RDD in a multi node environment. Where are the keys defined that is used for partitioning? For instance in below example keys seem to be impli

Unable to connect

2015-03-13 Thread Mohit Anchlia
I am running spark streaming standalone in ec2 and I am trying to run wordcount example from my desktop. The program is unable to connect to the master, in the logs I see, which seems to be an issue with hostname. 15/03/13 17:37:44 ERROR EndpointWriter: dropping message [class akka.actor.ActorSele

Re: Partitioning

2015-03-13 Thread Mohit Anchlia
docs). But use it at your own risk. If you modify the > keys, and yet preserve partitioning, the partitioning would not make sense > any more as the hash of the keys have changed. > > TD > > > > On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia > wrote: > >> I am try

Load balancing

2015-03-19 Thread Mohit Anchlia
I am trying to understand how to load balance the incoming data to multiple spark streaming workers. Could somebody help me understand how I can distribute my incoming data from various sources such that incoming data is going to multiple spark streaming nodes? Is it done by spark client with help

Spark streaming alerting

2015-03-21 Thread Mohit Anchlia
Is there a module in spark streaming that lets you listen to the alerts/conditions as they happen in the streaming module? Generally spark streaming components will execute on large set of clusters like hdfs or Cassandra, however when it comes to alerting you generally can't send it directly from t

Re: Load balancing

2015-03-22 Thread Mohit Anchlia
incoming rate, they > may vary. > > I do not know exactly what the life cycle of the receiver is, but I don't > think sth actually happens when you create the DStream. My guess would be > that the receiver is allocated when you call > StreamingContext#startStreams(), >

Re: Spark streaming alerting

2015-03-23 Thread Mohit Anchlia
h which you could do: > > val data = ssc.textFileStream("sigmoid/") > val dist = data.filter(_.contains("ERROR")).foreachRDD(rdd => > alert("Errors :" + rdd.count())) > > And the alert() function could be anything triggering an email or

akka.version error

2015-03-24 Thread Mohit Anchlia
I am facing the same issue as listed here: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-td5615.html Solution mentioned is here: https://gist.github.com/prb/d776a47bd164f704eecb However, I think I don't understand few things: 1) Why are jars being split

WordCount example

2015-03-26 Thread Mohit Anchlia
I am trying to run the word count example but for some reason it's not working as expected. I start "nc" server on port and then submit the spark job to the cluster. Spark job gets successfully submitting but I never see any connection from spark getting established. I also tried to type words

Re: WordCount example

2015-03-26 Thread Mohit Anchlia
What's the best way to troubleshoot inside spark to see why Spark is not connecting to nc on port ? I don't see any errors either. On Thu, Mar 26, 2015 at 2:38 PM, Mohit Anchlia wrote: > I am trying to run the word count example but for some reason it's not > working

Re: WordCount example

2015-03-27 Thread Mohit Anchlia
or netstat to make sure Spark executor backend > is connected to the nc process. Also grep the executor's log to see if > there's log like "Connecting to " and "Connected to > " which shows that receiver is correctly connected to nc process. > > Thanks

Re: WordCount example

2015-03-30 Thread Mohit Anchlia
I tried to file a bug in git repo however I don't see a link to "open issues" On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia wrote: > I checked the ports using netstat and don't see any connections > established on that port. Logs show only this: > > 15/03/27 1

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
t; 15/03/27 13:50:53 IN > > > > On Thu, Mar 26, 2015 at 6:50 PM, Saisai Shao > wrote: > >> Hi, >> >> Did you run the word count example in Spark local mode or other mode, in >> local mode you have to set Local[n], where n >=2. For other mode, make sure

Re: WordCount example

2015-04-03 Thread Mohit Anchlia
any cores are present in the works allocated to the standalone > cluster spark://ip-10-241-251-232:7077 ? > > > On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia > wrote: > >> If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this >> seems to work. I don'

Re: WordCount example

2015-04-06 Thread Mohit Anchlia
Interesting, I see 0 cores in the UI? - *Cores:* 0 Total, 0 Used On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das wrote: > What does the Spark Standalone UI at port 8080 say about number of cores? > > On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia > wrote: > >> [ec2-u

start-slave.sh not starting

2015-04-08 Thread Mohit Anchlia
I am trying to start the worker by: sbin/start-slave.sh spark://ip-10-241-251-232:7077 In the logs it's complaining about: Master must be a URL of the form spark://hostname:port I also have this in spark-defaults.conf spark.master spark://ip-10-241-251-232:7077 Did I miss

Class incompatible error

2015-04-08 Thread Mohit Anchlia
I am seeing the following, is this because of my maven version? 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-241-251-232.us-west-2.compute.internal): java.io.InvalidClassException: org.apache.spark.Aggregator; local class incompatible: stream classdesc serialVers

Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
Finally got it working by increasing the spark version in maven to 1.2.1 On Thu, Apr 9, 2015 at 12:30 PM, Mohit Anchlia wrote: > I changed the JDK to Oracle but I still get this error. Not sure what it > means by "Stream class is incompatible with local class". I am using the &

Re: Class incompatible error

2015-04-09 Thread Mohit Anchlia
> I don't have experience with mixed JDK's. > Can you try with using single JDK ? > > Cheers > > On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia > wrote: > >> For the build I am using java version "1.7.0_65" which seems to be the >> same as

Re: Class incompatible error

2015-04-08 Thread Mohit Anchlia
> Cheers > > On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia > wrote: > >> I am seeing the following, is this because of my maven version? >> >> 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, >> ip-10-241-251-232.us-wes