Greetings,

I am currently responsible for tuning Google's leveldb implementation for Riak. 
 I have read through most of the thread and have a couple of information 
requests.  Then I will try to address various questions and comments from the 
thread.  In general, you are filling leveldb faster than its background 
compaction (optimization) can keep up.  I am willing to work with you to figure 
out why and what can be done about it.

Questions / requests:

1.  Execute the following on one of the servers:

     sort /home/riak/leveldb/*/LOG* >log_jan.txt

     Tar/gzip the log_jan.txt and email it back.

2.  Execute the following on one of the servers:

    grep -i flags /proc/cpuinfo

    Include the output (actually just one line will do) in a reply.

3.  On a running server that is processing data, execute:

   grep -i swap /proc/meminfo

    Include the full output (3 lines) in a reply.

4.  Pick a server, then one directory in /home/riak/leveldb.  Select 3 of the 
largest *.sst files.  Tar/gzip those and email back.


Notes about other messages on this thread:

a.  the gdb stack traces are nice!  They clearly indicate that the leveldb has 
intentionally entered a "stall" state because compaction is not keeping up with 
the input stream.  Riak 1.2.1rc1 contains code that attempts to slow the write 
rate to allow the background compactions to catch up.  It is not working in 
your case.

b.  there is a performance bug in the cache code, not your main problem though. 
 this is why Evan asked you to reduce the cache size from 377,487,360.  Yes, I 
created the bug and will get it addressed soon.

c.  the compaction process is disk and cpu intensive.  The fact that your CPUs 
are not heavily loaded, yet the client/request code is stalled waiting for 
compaction to catch up, suggests the disk is thrashing / could use some help.  
Again, this is why Evan had you work some configuration settings there.

d.  you comment about using O_NOATIME is valid.  The issue is that the flag is 
relatively new.  We are supporting some really old compilers and linux/solaris 
versions.  It is easier to ask everyone to work noatime at the mount level than 
have conditional code for some and mount level tuning for others.  But your 
comment is still correct.

e.  a non-zero sized lost/BLOCKS.bad means data corruption.  It looks like you 
already figured that out.  Either the crc code or the decompression code found 
an issue during compaction and moved the bad data to the side.

f.  max_open_files in 1.1 was a hard limit on the number of open files per 
vnode (per subdirectory in /home/riak/leveldb).  1.2 uses the number as more of 
a memory consumption per file suggestion.  A future release will drop the 
option and substitute something like "file_cache_size".  Memory is the critical 
resource, not file handles (at least for Riak … I am told Google uses this code 
in Android, so it might be critical there).

What issues did I miss?

Matthew
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to