Also sorry, to clarify the job context:
- This is a job running across 5 nodes on AWS Linux
- It is under load with a large number of partitions: approximately 700-800
topic-partitions assignments in total for the entire job. Topics involved
have large # of partitions, 128 each.
- 32 stream thread
That's great news, thanks!
On Thu, Jul 6, 2017 at 6:18 AM, Damian Guy wrote:
> Hi Greg,
> I've been able to reproduce it by running multiple instances with standby
> tasks and many threads. If i force some rebalances, then i see the failure.
> Now to see if i can repro in a test.
> I think it is
Hi Greg,
I've been able to reproduce it by running multiple instances with standby
tasks and many threads. If i force some rebalances, then i see the failure.
Now to see if i can repro in a test.
I think it is probably the same issue as:
https://issues.apache.org/jira/browse/KAFKA-5070
On Thu, 6 J
Greg, what OS are you running on?
Are you able to reproduce this in a test at all?
For instance, based on what you described it would seem that i should be
able to start a streams app, wait for it to be up and running, run the
state dir cleanup, see it fail. However, i can't reproduce it.
On Wed,
Thanks Greg. I'll look into it more tomorrow. Just finding it difficult to
reproduce in a test.
Thanks for providing the sequence, gives me something to try and repo.
Appreciated.
Thanks,
Damian
On Wed, 5 Jul 2017 at 19:57, Greg Fodor wrote:
> Also, the sequence of events is:
>
> - Job starts, r
Also, the sequence of events is:
- Job starts, rebalance happens, things run along smoothly.
- After 10 minutes (retrospectively) the cleanup task kicks on and removes
some directories
- Tasks immediately start failing when trying to flush their state stores
On Wed, Jul 5, 2017 at 11:55 AM, Gre
The issue I am hitting is not the directory locking issues we've seen in
the past. The issue seems to be, as you mentioned, that the state dir is
getting deleted by the store cleanup process, but there are still tasks
running that are trying to flush the state store. It seems more than a
little sca
BTW - i'm trying to reproduce it, but not having much luck so far...
On Wed, 5 Jul 2017 at 09:27 Damian Guy wrote:
> Thans for the updates Greg. There were some minor changes around this in
> 0.11.0 to make it less likely to happen, but we've only ever seen the
> locking fail in the event of a r
Thans for the updates Greg. There were some minor changes around this in
0.11.0 to make it less likely to happen, but we've only ever seen the
locking fail in the event of a rebalance. When everything is running state
dirs shouldn't be deleted if they are being used as the lock will fail.
On Wed,
I can report that setting state.cleanup.delay.ms to a very large value
(effectively disabling it) works around the issue. It seems that the state
store cleanup process can somehow get out ahead of another task that still
thinks it should be writing to the state store/flushing it. In my test
runs, t
Upon another run, I see the same error occur during a rebalance, so either
my log was showing a rebalance or there is a shared underlying issue with
state stores.
On Tue, Jul 4, 2017 at 11:35 AM, Greg Fodor wrote:
> Also, I am on 0.10.2.1, so poll interval was already set to MAX_VALUE.
>
> On Tu
Also, I am on 0.10.2.1, so poll interval was already set to MAX_VALUE.
On Tue, Jul 4, 2017 at 11:28 AM, Greg Fodor wrote:
> I've nuked the nodes this happened on, but the job had been running for
> about 5-10 minutes across 5 nodes before this happened. Does the log show a
> rebalance was happen
I've nuked the nodes this happened on, but the job had been running for
about 5-10 minutes across 5 nodes before this happened. Does the log show a
rebalance was happening? It looks to me like the standby task was just
committing as part of normal operations.
On Tue, Jul 4, 2017 at 7:40 AM, Damian
Hi Greg,
Obviously a bit difficult to read the RocksDBException, but my guess is it
is because the state directory gets deleted right before the flush happens:
2017-07-04 10:54:46,829 [myid:] - INFO [StreamThread-21:StateDirectory@213]
- Deleting obsolete state directory 0_10 for task 0_10
Yes i
14 matches
Mail list logo