Found one more intersting fact. As I can see in cfstats, compacted row maximum size: 386857368 !
On Fri, Sep 21, 2012 at 12:50 PM, Denis Gabaydulin <gaba...@gmail.com> wrote: > Reports - is a SuperColumnFamily > > Each report has unique identifier (report_id). This is a key of > SuperColumnFamily. > And a report saved in separate row. > > A report is consisted of report rows (may vary between 1 and 500000, > but most are small). > > Each report row is saved in separate super column. Hector based code: > > superCfMutator.addInsertion( > report_id, > "Reports", > HFactory.createSuperColumn( > report_row_id, > mapper.convertObject(object), > columnDefinition.getTopSerializer(), > columnDefinition.getSubSerializer(), > inferringSerializer > ) > ); > > We have two frequent operation: > > 1. count report rows by report_id (calculate number of super columns > in the row). > 2. get report rows by report_id and range predicate (get super columns > from the row with range predicate). > > I can't see here a big super columns :-( > > On Fri, Sep 21, 2012 at 3:10 AM, Tyler Hobbs <ty...@datastax.com> wrote: >> I'm not 100% that I understand your data model and read patterns correctly, >> but it sounds like you have large supercolumns and are requesting some of >> the subcolumns from individual super columns. If that's the case, the issue >> is that Cassandra must deserialize the entire supercolumn in memory whenever >> you read *any* of the subcolumns. This is one of the reasons why composite >> columns are recommended over supercolumns. >> >> >> On Thu, Sep 20, 2012 at 6:45 AM, Denis Gabaydulin <gaba...@gmail.com> wrote: >>> >>> p.s. Cassandra 1.1.4 >>> >>> On Thu, Sep 20, 2012 at 3:27 PM, Denis Gabaydulin <gaba...@gmail.com> >>> wrote: >>> > Hi, all! >>> > >>> > We have a cluster with virtual 7 nodes (disk storage is connected to >>> > nodes with iSCSI). The storage schema is: >>> > >>> > Reports:{ >>> > 1:{ >>> > 1:{"value1":"some val", "value2":"some val"}, >>> > 2:{"value1":"some val", "value2":"some val"} >>> > ... >>> > }, >>> > 2:{ >>> > 1:{"value1":"some val", "value2":"some val"}, >>> > 2:{"value1":"some val", "value2":"some val"} >>> > ... >>> > } >>> > ... >>> > } >>> > >>> > create keyspace osmp_reports >>> > with placement_strategy = 'SimpleStrategy' >>> > and strategy_options = {replication_factor : 4} >>> > and durable_writes = true; >>> > >>> > use osmp_reports; >>> > >>> > create column family QueryReportResult >>> > with column_type = 'Super' >>> > and comparator = 'BytesType' >>> > and subcomparator = 'BytesType' >>> > and default_validation_class = 'BytesType' >>> > and key_validation_class = 'BytesType' >>> > and read_repair_chance = 1.0 >>> > and dclocal_read_repair_chance = 0.0 >>> > and gc_grace = 432000 >>> > and min_compaction_threshold = 4 >>> > and max_compaction_threshold = 32 >>> > and replicate_on_write = true >>> > and compaction_strategy = >>> > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' >>> > and caching = 'KEYS_ONLY'; >>> > >>> > ============================================= >>> > >>> > Read/Write CL: 2 >>> > >>> > Most of the reports are small, but some of them could have a half >>> > mullion of rows (xml). Typical operations on this dataset is: >>> > >>> > count report rows by report_id (top level id of super column); >>> > get columns (report_rows) by range predicate and limit for given >>> > report_id. >>> > >>> > A data is written once and hasn't never been updated. >>> > >>> > So, time to time a couple of nodes crashes with OOM exception. Heap >>> > dump says, that we have a lot of super columns in memory. >>> > For example, I see one of the reports is in memory entirely. How it >>> > could be possible? If we don't load the whole report, cassandra could >>> > whether do this for some internal reasons? >>> > >>> > What should we do to avoid OOMs? >> >> >> >> >> -- >> Tyler Hobbs >> DataStax >>