index sampling

Radim Kolar Tue, 27 Dec 2011 08:35:33 -0800

> That is a good reason for both to be configurable IMO.

index sampling is currently configurable only per node, it would bebetter to have it per Keyspace because we are using OLTP like and OLAPkeyspaces in same cluster. OLAP Keyspaces has about 1000x more rows.

But its difficult to estimate index sampling memory until there will beway to monitor memory used by index samplinghttps://issues.apache.org/jira/browse/CASSANDRA-3662 . Java can useabout 10x more memory than raw data for index sample entry - and fromsstable/IndexSummary.java it seems that cassandra is using one bigarrayList with <RowPosition,long>.

on node with 300m rows (small node), it will be 585937 index sampleentries with 512 sampling. lets say 100 bytes per entry this will be 585MB, bloom filters are 884 MB. With default sampling 128, sampled entrieswill use majority of node memory. Index sampling should be reworked likebloom filters to avoid allocating one large array per sstable. hadoopmapfile is using sampling 128 by default too and it reads entire mapfileindex into memory.

it should be clearly documented inhttp://wiki.apache.org/cassandra/LargeDataSetConsiderations - that bloomfilters + index sampling will be responsible for most memory used bynode. Caching itself has minimal use on large data set used for OLAP.

index sampling

Reply via email to