Reports - is a SuperColumnFamily Each report has unique identifier (report_id). This is a key of SuperColumnFamily. And a report saved in separate row.
A report is consisted of report rows (may vary between 1 and 500000, but most are small). Each report row is saved in separate super column. Hector based code: superCfMutator.addInsertion( report_id, "Reports", HFactory.createSuperColumn( report_row_id, mapper.convertObject(object), columnDefinition.getTopSerializer(), columnDefinition.getSubSerializer(), inferringSerializer ) ); We have two frequent operation: 1. count report rows by report_id (calculate number of super columns in the row). 2. get report rows by report_id and range predicate (get super columns from the row with range predicate). I can't see here a big super columns :-( On Fri, Sep 21, 2012 at 3:10 AM, Tyler Hobbs <ty...@datastax.com> wrote: > I'm not 100% that I understand your data model and read patterns correctly, > but it sounds like you have large supercolumns and are requesting some of > the subcolumns from individual super columns. If that's the case, the issue > is that Cassandra must deserialize the entire supercolumn in memory whenever > you read *any* of the subcolumns. This is one of the reasons why composite > columns are recommended over supercolumns. > > > On Thu, Sep 20, 2012 at 6:45 AM, Denis Gabaydulin <gaba...@gmail.com> wrote: >> >> p.s. Cassandra 1.1.4 >> >> On Thu, Sep 20, 2012 at 3:27 PM, Denis Gabaydulin <gaba...@gmail.com> >> wrote: >> > Hi, all! >> > >> > We have a cluster with virtual 7 nodes (disk storage is connected to >> > nodes with iSCSI). The storage schema is: >> > >> > Reports:{ >> > 1:{ >> > 1:{"value1":"some val", "value2":"some val"}, >> > 2:{"value1":"some val", "value2":"some val"} >> > ... >> > }, >> > 2:{ >> > 1:{"value1":"some val", "value2":"some val"}, >> > 2:{"value1":"some val", "value2":"some val"} >> > ... >> > } >> > ... >> > } >> > >> > create keyspace osmp_reports >> > with placement_strategy = 'SimpleStrategy' >> > and strategy_options = {replication_factor : 4} >> > and durable_writes = true; >> > >> > use osmp_reports; >> > >> > create column family QueryReportResult >> > with column_type = 'Super' >> > and comparator = 'BytesType' >> > and subcomparator = 'BytesType' >> > and default_validation_class = 'BytesType' >> > and key_validation_class = 'BytesType' >> > and read_repair_chance = 1.0 >> > and dclocal_read_repair_chance = 0.0 >> > and gc_grace = 432000 >> > and min_compaction_threshold = 4 >> > and max_compaction_threshold = 32 >> > and replicate_on_write = true >> > and compaction_strategy = >> > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' >> > and caching = 'KEYS_ONLY'; >> > >> > ============================================= >> > >> > Read/Write CL: 2 >> > >> > Most of the reports are small, but some of them could have a half >> > mullion of rows (xml). Typical operations on this dataset is: >> > >> > count report rows by report_id (top level id of super column); >> > get columns (report_rows) by range predicate and limit for given >> > report_id. >> > >> > A data is written once and hasn't never been updated. >> > >> > So, time to time a couple of nodes crashes with OOM exception. Heap >> > dump says, that we have a lot of super columns in memory. >> > For example, I see one of the reports is in memory entirely. How it >> > could be possible? If we don't load the whole report, cassandra could >> > whether do this for some internal reasons? >> > >> > What should we do to avoid OOMs? > > > > > -- > Tyler Hobbs > DataStax >