Reports - is a SuperColumnFamily

Each report has unique identifier (report_id). This is a key of
SuperColumnFamily.
And a report saved in separate row.

A report is consisted of report rows (may vary between 1 and 500000,
but most are small).

Each report row is saved in separate super column. Hector based code:

superCfMutator.addInsertion(
  report_id,
  "Reports",
  HFactory.createSuperColumn(
    report_row_id,
    mapper.convertObject(object),
    columnDefinition.getTopSerializer(),
    columnDefinition.getSubSerializer(),
    inferringSerializer
  )
);

We have two frequent operation:

1. count report rows by report_id (calculate number of super columns
in the row).
2. get report rows by report_id and range predicate (get super columns
from the row with range predicate).

I can't see here a big super columns :-(

On Fri, Sep 21, 2012 at 3:10 AM, Tyler Hobbs <ty...@datastax.com> wrote:
> I'm not 100% that I understand your data model and read patterns correctly,
> but it sounds like you have large supercolumns and are requesting some of
> the subcolumns from individual super columns.  If that's the case, the issue
> is that Cassandra must deserialize the entire supercolumn in memory whenever
> you read *any* of the subcolumns.  This is one of the reasons why composite
> columns are recommended over supercolumns.
>
>
> On Thu, Sep 20, 2012 at 6:45 AM, Denis Gabaydulin <gaba...@gmail.com> wrote:
>>
>> p.s. Cassandra 1.1.4
>>
>> On Thu, Sep 20, 2012 at 3:27 PM, Denis Gabaydulin <gaba...@gmail.com>
>> wrote:
>> > Hi, all!
>> >
>> > We have a cluster with virtual 7 nodes (disk storage is connected to
>> > nodes with iSCSI). The storage schema is:
>> >
>> > Reports:{
>> >     1:{
>> >         1:{"value1":"some val", "value2":"some val"},
>> >         2:{"value1":"some val", "value2":"some val"}
>> >         ...
>> >     },
>> >     2:{
>> >         1:{"value1":"some val", "value2":"some val"},
>> >         2:{"value1":"some val", "value2":"some val"}
>> >         ...
>> >     }
>> >     ...
>> > }
>> >
>> > create keyspace osmp_reports
>> >   with placement_strategy = 'SimpleStrategy'
>> >   and strategy_options = {replication_factor : 4}
>> >   and durable_writes = true;
>> >
>> > use osmp_reports;
>> >
>> > create column family QueryReportResult
>> >   with column_type = 'Super'
>> >   and comparator = 'BytesType'
>> >   and subcomparator = 'BytesType'
>> >   and default_validation_class = 'BytesType'
>> >   and key_validation_class = 'BytesType'
>> >   and read_repair_chance = 1.0
>> >   and dclocal_read_repair_chance = 0.0
>> >   and gc_grace = 432000
>> >   and min_compaction_threshold = 4
>> >   and max_compaction_threshold = 32
>> >   and replicate_on_write = true
>> >   and compaction_strategy =
>> > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
>> >   and caching = 'KEYS_ONLY';
>> >
>> > =============================================
>> >
>> > Read/Write CL: 2
>> >
>> > Most of the reports are small, but some of them could have a half
>> > mullion of rows (xml). Typical operations on this dataset is:
>> >
>> > count report rows by report_id (top level id of super column);
>> > get columns (report_rows) by range predicate and limit for given
>> > report_id.
>> >
>> > A data is written once and hasn't never been updated.
>> >
>> > So, time to time a couple of nodes crashes with OOM exception. Heap
>> > dump says, that we have a lot of super columns in memory.
>> > For example, I see one of the reports is in memory entirely. How it
>> > could be possible? If we don't load the whole report, cassandra could
>> > whether do this for some internal reasons?
>> >
>> > What should we do to avoid OOMs?
>
>
>
>
> --
> Tyler Hobbs
> DataStax
>

Reply via email to