Re: automatic start of streaming job on failure on YARN

2015-10-02 Thread Ashish Rangole
Are you running the job in yarn cluster mode? On Oct 1, 2015 6:30 AM, "Jeetendra Gangele" wrote: > We've a streaming application running on yarn and we would like to ensure > that is up running 24/7. > > Is there a way to tell yarn to automatically restart a specific > application on failure? > >

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ashish Rangole
I suggest taking a heap dump of driver process using jmap. Then open that dump in a tool like Visual VM to see which object(s) are taking up heap space. It is easy to do. We did this and found out that in our case it was the data structure that stores info about stages, jobs and tasks. There can be

Re: Worker Machine running out of disk for Long running Streaming process

2015-08-22 Thread Ashish Rangole
Interesting. TD, can you please throw some light on why this is and point to the relevant code in Spark repo. It will help in a better understanding of things that can affect a long running streaming job. On Aug 21, 2015 1:44 PM, "Tathagata Das" wrote: > Could you periodically (say every 10 mins

Re: Query REST web service with Spark?

2015-03-31 Thread Ashish Rangole
All you need is a client to the target REST service in your Spark task. It could be as simple as a HttpClient. Most likely that client won't be serializable in which case you initialize it lazily. There are useful examples in Spark knowledge base gitbook that you can look at. On Mar 31, 2015 1:48 P

Re: randomSplit instead of a huge map & reduce ?

2015-02-20 Thread Ashish Rangole
Is there a check you can put in place to not create pairs that aren't in your set of 20M pairs? Additionally, once you have your arrays converted to pairs you can do aggregateByKey with each pair being the key. On Feb 20, 2015 1:57 PM, "shlomib" wrote: > Hi, > > I am new to Spark and I think I mi

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Ashish Rangole
By default, the files will be created under the path provided as the argument for saveAsTextFile. This argument is considered as a folder in the bucket and actual files are created in it with the naming convention part-n, where n is the number of output partition. On Mon, Jan 26, 2015 at 9

Re: Why so many tasks?

2014-12-16 Thread Ashish Rangole
Take a look at combine file input format. Repartition or coalesce could introduce shuffle I/O overhead. On Dec 16, 2014 7:09 AM, "bethesda" wrote: > Thank you! I had known about the small-files problem in HDFS but didn't > realize that it affected sc.textFile(). > > > > -- > View this message in

Re: spark-submit on YARN is slow

2014-12-05 Thread Ashish Rangole
Likely this not the case here yet one thing to point out with Yarn parameters like --num-executors is that they should be specified *before* app jar and app args on spark-submit command line otherwise the app only gets the default number of containers which is 2. On Dec 5, 2014 12:22 PM, "Sandy Ryz

Re: Loading RDDs in a streaming fashion

2014-12-02 Thread Ashish Rangole
This is a common use case and this is how Hadoop APIs for reading data work, they return an Iterator [Your Record] instead of reading every record in at once. On Dec 1, 2014 9:43 PM, "Andy Twigg" wrote: > You may be able to construct RDDs directly from an iterator - not sure > - you may have to s

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ashish Rangole
This being a very broad topic, a discussion can quickly get subjective. I'll try not to deviate from my experiences and observations to keep this thread useful to those looking for answers. I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as well as Spark (since v 0.6) in Sc

Re: Spark S3 Performance

2014-11-22 Thread Ashish Rangole
What makes you think that each executor is reading the whole file? If that is the case then the count value returned to the driver will be actual X NumOfExecutors. Is that the case when compared with actual lines in the input file? If the count returned is same as actual then you probably don't hav

Re: How to read a multipart s3 file?

2014-08-07 Thread Ashish Rangole
Specify a folder instead of a file name for input and output code, as in: Output: s3n://your-bucket-name/your-data-folder Input: (when consuming the above output) s3n://your-bucket-name/your-data-folder/* On May 6, 2014 5:19 PM, "kamatsuoka" wrote: > I have a Spark app that writes out a file,

Re: Problem in Spark Streaming

2014-06-10 Thread Ashish Rangole
Have you considered the garbage collection impact and if it coincides with your latency spikes? You can enable gc logging by changing Spark configuration for your job. Hi, as I searched the keyword "Total delay" in the console log, the delay keeps increasing. I am not sure what does this "total del