Pretty sure your argument about indirect blocks making large files
inefficient only pertains to ext2/3 and not ext4. It seems ext4 replaces the
'indirect block' approach with extents
(http://kernelnewbies.org/Ext4#head-7c5fd53118e8b888345b95cc11756346be4268f4
, http://en.wikipedia.org/wiki/Ext4#Features). 

I was not aware of this difference in the file systems and it seems to be a
compelling reason ext4 should be chosen (over ext3) for Cassandra - at least
when using size tiered compaction. 

Dan

-----Original Message-----
From: Radim Kolar [mailto:h...@sendmail.cz] 
Sent: November-19-11 19:42
To: user@cassandra.apache.org
Subject: Re: split large sstable

Dne 17.11.2011 17:42, Dan Hendry napsal(a):
> What do you mean by ' better file offset caching'? Presumably you mean
> 'better page cache hit rate'?
fs metadata used to find blocks in smaller files are cached better. 
Large files are using indirect blocks and you need more reads to find 
correct block during seek syscall. For example if large file is using 3 
indirect levels, you need 3xdisk seek to find correct block. 
http://computer-forensics.sans.org/blog/2008/12/24/understanding-indirect-bl
ocks-in-unix-file-systems/ 
Metadata caching in OS is far worse then file caching - one "find /" 
will effectively nullify metadata cache.

If cassandra could use raw storage. it will eliminate fs overhead and it 
could be over 100% faster on reads because fragmentation will be an 
exception - no need to design fs like FAT or UFS where designers expects 
files to be stored in non continuous area on disk.  Implementing 
something log based like - http://logfs.sourceforge.net/ will be enough. 
Cleaning will not be much needed because compaction will clean it naturally.

> Perhaps what you are actually seeing is row fragmentation across your
> SSTables? Easy to check with nodetool cfhistograms (SSTables column).
i have 1.5% hitrate to 2 sstables and 3% to hit 3 sstables. Its pretty 
low with min. compaction set to 5, i will probably set it to 6.

I would really like to see tests with user defined sizes and file counts 
used for tiered compaction because it work best if you do not leave 
largest file alone in bucket. If your data in cassandra are not growing, 
it can be better fine tuned. i havent done experiments with it but maybe 
max sstable size defined per cf will be enough. Lets say i have 5 GB 
data per CF - ideal setting will be max sstable size to slightly less 
then 1 GB. Cassandra will not keep old data stuck in one 4 GB compacted 
sstable waiting for other 4 GB sstables to be created before compaction 
will remove old data.

> To answer your question, I know of no tools to split SSTables. If you want
> to switch compaction strategies, levelled compaction (1.0.x) creates many
> smaller sstables instead of fewer, bigger ones.
I dont use levelled compaction, it compacts too often. It might get 
better if it can be tuned how many and how large files to use at each 
level. But i will try to switch to levelled compaction and back again it 
might do what i want.
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.920 / Virus Database: 271.1.1/4029 - Release Date: 11/20/11
14:34:00

Reply via email to