end and having
>>>> too few in the beginning. By coalescing at late as possible and avoiding
>>>> too few in the beginning, the problems seems to decrease. Also, increasing
>>>> spark.akka.askTimeout and spark.core.connection.ack.wait.timeout
>>>
gt;> too few in the beginning, the problems seems to decrease. Also, increasing
>>> spark.akka.askTimeout and spark.core.connection.ack.wait.timeout
>>> significantly (~700 secs), the problems seems to almost disappear. Don't
>>> wont to celebrate yet, still long
t's
>> looking better...
>>
>> On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet wrote:
>>
>>> I'm looking @ my yarn container logs for some of the executors which
>>> appear to be failing (with the missing shuffle files). I see exceptions
>>> that
e but it's
> looking better...
>
> On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet wrote:
>
>> I'm looking @ my yarn container logs for some of the executors which
>> appear to be failing (with the missing shuffle files). I see exceptions
>> that say
ing better...
On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet wrote:
> I'm looking @ my yarn container logs for some of the executors which
> appear to be failing (with the missing shuffle files). I see exceptions
> that say "client.TransportClientFactor: Found inactive connection
I'm looking @ my yarn container logs for some of the executors which appear
to be failing (with the missing shuffle files). I see exceptions that say
"client.TransportClientFactor: Found inactive connection to host/ip:port,
closing it."
Right after that I see "shuffle.
No, unfortunately we're not making use of dynamic allocation or the
external shuffle service. Hoping that we could reconfigure our cluster to
make use of it, but since it requires changes to the cluster itself (and
not just the Spark app), it could take some time.
Unsure if task 450 was acting as
Do you guys have dynamic allocation turned on for YARN?
Anders, was Task 450 in your job acting like a Reducer and fetching the Map
spill output data from a different node?
If a Reducer task can't read the remote data it needs, that could cause the
stage to fail. Sometimes this forces the previou
Could you try to turn on the external shuffle service?
spark.shuffle.service.enable= true
On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing
that executors are being lost as well. Thing is, I can't figure out
how they are dying. I'm u
I'm experiencing the same issue. Upon closer inspection I'm noticing that
executors are being lost as well. Thing is, I can't figure out how they are
dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
allocated for the application. I was thinking perhaps it was possible that
a s
For large jobs, the following error message is shown that seems to indicate
that shuffle files for some reason are missing. It's a rather large job
with many partitions. If the data size is reduced, the problem disappears.
I'm running a build from Spark master post 1.2 (build at 2015-01-16) and
run
11 matches
Mail list logo