Could it be some “fat columns” (cells with large blob or text values) rather 
than the cell-count per se? IOW, a “big row” rather than a “wide row”?

And, could it be a large partition rather than a large row (many rows in a 
single partition)? Are clustering columns being used in the primary key?

-- Jack Krupansky

From: DuyHai Doan 
Sent: Thursday, July 24, 2014 3:53 PM
To: user@cassandra.apache.org 
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The 
widest has 61214 cells which is big but not that huge to be really a concern.


Turning on trace probabilty only tells give you some "hints" about what kind of 
queries are done, it does not give the exact partition key nor other statement 
values, especially when you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml 
so that it reduces the impact when this occurs." --> Don't do that, you'll only 
sweep dust under the carpet. Find the real issue and fix it instead of changing 
parameter to hide it.


One solution would be on client side, to activate some logging to show the CQL3 
statements the application is issuing that may overload the server.  I know 
that's better said than done but I don't have any other idea for the moment


-------- Shameless self-promotion ------


To support this kind of live prod debugging & investigation that I add a new 
dynamic query logging feature in Achilles: 
https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging


Once you hit the issue, this kind of feature may save your day...









On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kwri...@nanigans.com> wrote:

  I can see from cfhistograms that I do have some wide rows (see below).  I set 
trace probability as you suggested but the output doesn’t appear to tell me 
what row was actually read unless I missed something.  I just see executing 
prepared statement.   Any ideas how I can find the row in question?

  I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml 
so that it reduces the impact when this occurs.

  Any help in identifying my issue would be GREATLY appreciated

  Cell Count per Partition

      1 cells: 50449950

      2 cells: 14281828

      3 cells: 8093366

      4 cells: 5029200

      5 cells: 3103023

      6 cells: 3059903

      7 cells: 1903018

      8 cells: 1509297

     10 cells: 2420359

     12 cells: 1624895

     14 cells: 1171678

     17 cells: 1289391

     20 cells: 909777

     24 cells: 852081

     29 cells: 722925

     35 cells: 587067

     42 cells: 459473

     50 cells: 358744

     60 cells: 304146

     72 cells: 244682

     86 cells: 191045

    103 cells: 155337

    124 cells: 127061

    149 cells: 98913

    179 cells: 77454

    215 cells: 59849

    258 cells: 46117

    310 cells: 35321

    372 cells: 26319

    446 cells: 19379

    535 cells: 13783

    642 cells: 9993

    770 cells: 6973

    924 cells: 4713

  1109 cells: 3229

  1331 cells: 2062

  1597 cells: 1338

  1916 cells: 773

  2299 cells: 495

  2759 cells: 268

  3311 cells: 150

  3973 cells: 100

  4768 cells: 42

  5722 cells: 24

  6866 cells: 12

  8239 cells: 9

  9887 cells: 3

  11864 cells: 0

  14237 cells: 5

  17084 cells: 1

  20501 cells: 0

  24601 cells: 2

  29521 cells: 0

  35425 cells: 0

  42510 cells: 0

  51012 cells: 0

  61214 cells: 2


  From: DuyHai Doan <doanduy...@gmail.com>
  Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
  Date: Thursday, July 24, 2014 at 3:01 PM
  To: "user@cassandra.apache.org" <user@cassandra.apache.org>
  Subject: Re: Hot, large row


  "How can I detect wide rows?" -->


  nodetool cfhistograms <keyspace> <suspected column family>


  Look at column "Column count" (last column) and identify a line in this 
column with very high value of "Offset". In a well designed application you 
should have a gaussian distribution where 80% of your row have a similar number 
of columns.


  "Anyone know what debug level I can set so that I can see what reads the hot 
node is handling?  " -->


  "nodetool settraceprobability <value>",  where value is a small number 
(0.001) on the node where you encounter the issue. Activate the tracing for a 
while (5 mins) and deactivate it (value = 0). Then look into system_traces 
tables "events" & "sessions". It may help or not since the tracing is done once 
every 1000. 


  "Any way to get the server to blacklist these wide rows automatically?" --> No




  On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kwri...@nanigans.com> wrote:

    Hi all,

       We are seeing an issue where basically daily one of our nodes spikes in 
load and is churning in CMS heap pressure.  It appears that reads are backing 
up and my guess is that our application is reading a large row repeatedly.  Our 
write structure can lead itself to wide rows very infrequently (<0.001%) and we 
do our best to detect and delete them but obviously we’re missing a case.  
Hoping for assistance on the following questions:
      a.. How can I detect wide rows? 
      b.. Anyone know what debug level I can set so that I can see what reads 
the hot node is handling?  I’m hoping to see the “bad” row 
      c.. Any way to get the server to blacklist these wide rows automatically? 
    We’re using C* 2.0.6 with Vnodes.

    Thanks

Reply via email to