Ok I have two test servers, they are RH and pretty nice. I have two problems 
with one of them and none with the other. Same configuration but the seed and 
listen address that is their opposites. Nothing fancy. RF=2

All info I can get is also here and some more like conf, 590 rows
http://pastie.org/1131106

Problem nr 1 and the most annoying one.
I by emptying the data folder and commitlog folder and start the servers.

I write data to both nodes, this time CL.ONE but happen when CL.ALL aswell. The 
node that is troubling me is not writing memory to disc. As soon it is time to 
do that it just starts to GC and doing that for a long time and then enqueuing 
the flush and not write, its unresponsive during gc storms. The other node 
works just as expected, it takes the memory and writes it down in a matter of 
seconds, this is not a lot of memory and no reads.

Log from troubling node:
------------------------------------------
 INFO 10:42:26,842 GC for ParNew: 808 ms, 106688440 reclaimed leaving 
7273866048 used; max is 17388929024
 INFO 10:42:31,613 GC for ParNew: 882 ms, 120705376 reclaimed leaving 
7292752352 used; max is 17388929024
 INFO 10:42:32,615 GC for ParNew: 621 ms, 108181664 reclaimed leaving 
7324162368 used; max is 17388929024
 INFO 10:42:35,468 GC for ParNew: 732 ms, 107646952 reclaimed leaving 
7407855104 used; max is 17388929024
 INFO 10:42:36,540 GC for ParNew: 556 ms, 106819200 reclaimed leaving 
7440627584 used; max is 17388929024
 INFO 10:42:38,348 GC for ParNew: 676 ms, 111891904 reclaimed leaving 
7490450648 used; max is 17388929024
 INFO 10:42:39,413 GC for ParNew: 768 ms, 110205856 reclaimed leaving 
7519836472 used; max is 17388929024
 INFO 10:42:40,671 GC for ParNew: 755 ms, 112034384 reclaimed leaving 
7547393768 used; max is 17388929024
 INFO 10:42:41,884 GC for ParNew: 834 ms, 108972528 reclaimed leaving 
7578012920 used; max is 17388929024
 INFO 10:42:43,102 GC for ParNew: 971 ms, 110778800 reclaimed leaving 
7606825800 used; max is 17388929024
 INFO 10:42:44,391 GC for ParNew: 1076 ms, 109996232 reclaimed leaving 
7636421248 used; max is 17388929024
 ------------------------------------------
I had trouble copy pasting all of the data running the server remotely with 
putty.

Ring
Address       Status     Load          Range                                    
  Ring
                                       142713423890871059377105093567732377974
x.x.x.211 Up         486 bytes     45911723912241754468195357739525604647     
|<--|
x.x.x.209 Up         501.23 MB     142713423890871059377105093567732377974    
|-->|

tpstats from node that wont wake up from this state.

When doing the ParNew

Pool Name                    Active   Pending      Completed
STREAM-STAGE                      0         0              0
RESPONSE-STAGE                    0         0        1003801
ROW-READ-STAGE                    0         0              0
LB-OPERATIONS                     0         0              0
MISCELLANEOUS-POOL                0         0              0
GMFD                              0         0           1047
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE               32    183026        1035233
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             1         2              1
FLUSH-WRITER-POOL                 1         2              1
AE-SERVICE-STAGE                  0         0              0
HINTED-HANDOFF-POOL               0         0              2

When done with ParNew

Pool Name                    Active   Pending      Completed
STREAM-STAGE                      0         0              0
RESPONSE-STAGE                    0         0        1003801
ROW-READ-STAGE                    0         0              0
LB-OPERATIONS                     0         0              0
MISCELLANEOUS-POOL                0         0              0
GMFD                              0         0          17617
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE                0         0        1218212
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             1         2              2
FLUSH-WRITER-POOL                 1         2              2
AE-SERVICE-STAGE                  0         0              0
HINTED-HANDOFF-POOL               1         1              3

It is not that it is writing slowly but that is not writing at all, ever or 
extremely slowly I think it is writing from gossip not connections to the node. 
And not any amount and it has nothing to do with swapping or the 16gb it is 
allowed to use. The data is much smaller than this and it happens when first 
write of memtable is supposed to happen, the other node starts just at the same 
moment but it finishes and doesn't loop. If I restart the server it will write 
from the commitlog the data to datafolder and then stop working as soon as it 
is going to write new data from memtable.

The other problem with the same node is that if I use JNA it will kernel crash 
after out of memory error and it uses about all the 60gb ram although I told 
the jvm max 16gb. Its unresponsive from start and the whole server locks before 
making getting information hard to get but we know it is kernel crash because 
of oom.

If anyone have an idea about what is wrong it would help a lot.
/Justus


AB SVENSKA SPEL
106 10 Stockholm
Sturegatan 11, Sundbyberg
Växel +46 8 757 77 00
http://svenskaspel.se

Reply via email to