Re: split large sstable

Edward Capriolo Mon, 21 Nov 2011 07:34:33 -0800

On Mon, Nov 21, 2011 at 10:07 AM, Dan Hendry <dan.hendry.j...@gmail.com>wrote:


> Pretty sure your argument about indirect blocks making large files
> inefficient only pertains to ext2/3 and not ext4. It seems ext4 replaces
> the
> 'indirect block' approach with extents
> (
> http://kernelnewbies.org/Ext4#head-7c5fd53118e8b888345b95cc11756346be4268f4
> , http://en.wikipedia.org/wiki/Ext4#Features).
>
> I was not aware of this difference in the file systems and it seems to be a
> compelling reason ext4 should be chosen (over ext3) for Cassandra - at
> least
> when using size tiered compaction.
>
> Dan
>
> -----Original Message-----
> From: Radim Kolar [mailto:h...@sendmail.cz]
> Sent: November-19-11 19:42
> To: user@cassandra.apache.org
> Subject: Re: split large sstable
>
> Dne 17.11.2011 17:42, Dan Hendry napsal(a):
> > What do you mean by ' better file offset caching'? Presumably you mean
> > 'better page cache hit rate'?
> fs metadata used to find blocks in smaller files are cached better.
> Large files are using indirect blocks and you need more reads to find
> correct block during seek syscall. For example if large file is using 3
> indirect levels, you need 3xdisk seek to find correct block.
>
> http://computer-forensics.sans.org/blog/2008/12/24/understanding-indirect-bl
> ocks-in-unix-file-systems/
> Metadata caching in OS is far worse then file caching - one "find /"
> will effectively nullify metadata cache.
>
> If cassandra could use raw storage. it will eliminate fs overhead and it
> could be over 100% faster on reads because fragmentation will be an
> exception - no need to design fs like FAT or UFS where designers expects
> files to be stored in non continuous area on disk.  Implementing
> something log based like - http://logfs.sourceforge.net/ will be enough.
> Cleaning will not be much needed because compaction will clean it
> naturally.
>
> > Perhaps what you are actually seeing is row fragmentation across your
> > SSTables? Easy to check with nodetool cfhistograms (SSTables column).
> i have 1.5% hitrate to 2 sstables and 3% to hit 3 sstables. Its pretty
> low with min. compaction set to 5, i will probably set it to 6.
>
> I would really like to see tests with user defined sizes and file counts
> used for tiered compaction because it work best if you do not leave
> largest file alone in bucket. If your data in cassandra are not growing,
> it can be better fine tuned. i havent done experiments with it but maybe
> max sstable size defined per cf will be enough. Lets say i have 5 GB
> data per CF - ideal setting will be max sstable size to slightly less
> then 1 GB. Cassandra will not keep old data stuck in one 4 GB compacted
> sstable waiting for other 4 GB sstables to be created before compaction
> will remove old data.
>
> > To answer your question, I know of no tools to split SSTables. If you
> want
> > to switch compaction strategies, levelled compaction (1.0.x) creates many
> > smaller sstables instead of fewer, bigger ones.
> I dont use levelled compaction, it compacts too often. It might get
> better if it can be tuned how many and how large files to use at each
> level. But i will try to switch to levelled compaction and back again it
> might do what i want.
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.920 / Virus Database: 271.1.1/4029 - Release Date: 11/20/11
> 14:34:00
>
>
IMHO there is only one good reason left to use ext3. For a 100MB /boot
partition since the boot loaders have an easier time with it.

EXT4 is better then EXT3 in every way. It is the default formatting for
RHEL. Do not fight the future.

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/a_great_reason_to_use

Re: split large sstable

Reply via email to