Are you running the job in yarn cluster mode?
On Oct 1, 2015 6:30 AM, "Jeetendra Gangele" wrote:
> We've a streaming application running on yarn and we would like to ensure
> that is up running 24/7.
>
> Is there a way to tell yarn to automatically restart a specific
> application on failure?
>
>
I suggest taking a heap dump of driver process using jmap. Then open that
dump in a tool like Visual VM to see which object(s) are taking up heap
space. It is easy to do. We did this and found out that in our case it was
the data structure that stores info about stages, jobs and tasks. There can
be
Interesting. TD, can you please throw some light on why this is and point
to the relevant code in Spark repo. It will help in a better understanding
of things that can affect a long running streaming job.
On Aug 21, 2015 1:44 PM, "Tathagata Das" wrote:
> Could you periodically (say every 10 mins
All you need is a client to the target REST service in your Spark task. It
could be as simple as a HttpClient. Most likely that client won't be
serializable in which case you initialize it lazily. There are useful
examples in Spark knowledge base gitbook that you can look at.
On Mar 31, 2015 1:48 P
Is there a check you can put in place to not create pairs that aren't in
your set of 20M pairs? Additionally, once you have your arrays converted to
pairs you can do aggregateByKey with each pair being the key.
On Feb 20, 2015 1:57 PM, "shlomib" wrote:
> Hi,
>
> I am new to Spark and I think I mi
By default, the files will be created under the path provided as the
argument for saveAsTextFile. This argument is considered as a folder in the
bucket and actual files are created in it with the naming convention
part-n, where n is the number of output partition.
On Mon, Jan 26, 2015 at 9
Take a look at combine file input format. Repartition or coalesce could
introduce shuffle I/O overhead.
On Dec 16, 2014 7:09 AM, "bethesda" wrote:
> Thank you! I had known about the small-files problem in HDFS but didn't
> realize that it affected sc.textFile().
>
>
>
> --
> View this message in
Likely this not the case here yet one thing to point out with Yarn
parameters like --num-executors is that they should be specified *before*
app jar and app args on spark-submit command line otherwise the app only
gets the default number of containers which is 2.
On Dec 5, 2014 12:22 PM, "Sandy Ryz
This is a common use case and this is how Hadoop APIs for reading data
work, they return an Iterator [Your Record] instead of reading every record
in at once.
On Dec 1, 2014 9:43 PM, "Andy Twigg" wrote:
> You may be able to construct RDDs directly from an iterator - not sure
> - you may have to s
This being a very broad topic, a discussion can quickly get subjective.
I'll try not to deviate from my experiences and observations to keep this
thread useful to those looking for answers.
I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as
well as Spark (since v 0.6) in Sc
What makes you think that each executor is reading the whole file? If that
is the case then the count value returned to the driver will be actual X
NumOfExecutors. Is that the case when compared with actual lines in the
input file? If the count returned is same as actual then you probably don't
hav
Specify a folder instead of a file name for input and output code, as in:
Output:
s3n://your-bucket-name/your-data-folder
Input: (when consuming the above output)
s3n://your-bucket-name/your-data-folder/*
On May 6, 2014 5:19 PM, "kamatsuoka" wrote:
> I have a Spark app that writes out a file,
Have you considered the garbage collection impact and if it coincides with
your latency spikes? You can enable gc logging by changing Spark
configuration for your job.
Hi, as I searched the keyword "Total delay" in the console log, the delay
keeps increasing. I am not sure what does this "total del
13 matches
Mail list logo