Re: question about how columns are deserialized in memory

Sylvain Lebresne Wed, 28 Apr 2010 06:37:26 -0700

2010/4/28 Даниел Симеонов <dsimeo...@gmail.com>:
> Hi Sylvain,
>   Thank you very much! I still have some further questions, I didn't find
> how row cache is being configured?


Provided you don't use trunk but something stable like 0.6.1 (which
you should),
it is in storage-conf.xml. It's one option of the definition of the
column families (it
is documented in the file).

> Regarding the splitting of rows, I
> understand that it is not so necessary, still I am curious whether it is
> implementable by the client code.

Well, I'm not sure there is any simple way to do it (at least not
efficiently). Counting
the number of columns in a row is expensive plus there is no easy way
to implement
counter in cassandra (even though
https://issues.apache.org/jira/browse/CASSANDRA-580
will make that better someday).

> Best regards, Daniel.
>
> 2010/4/28 Sylvain Lebresne <sylv...@yakaz.com>
>>
>> 2010/4/28 Даниел Симеонов <dsimeo...@gmail.com>:
>> > Hi,
>> >    I have a question about if a row in a Column Family has only columns
>> > whether all of the columns are deserialized in memory if you need any of
>> > them? As I understood it is the case,
>>
>> No it's not. Only the columns you request are deserialized in memory. The
>> only
>> thing is that, as of now, during compaction the entire row will be
>> deserialize at
>> once. So it just have to still fit in memory. But depending of the
>> typical size of
>> your column, you can easily millions of columns in a row without it
>> being a problem
>> at all.
>>
>> >  and if the Column Family is super
>> > Column Family, then only the Super Column (entire) is brought up in
>> > memory?
>>
>> Yes, that part is true. That is the problem with the current
>> implementation of super
>> columns. While you can have lots of column in one row, you probably
>> don't want to
>> have lots of columns in one super column (but it's no problem to have
>> lots of super
>> column in one row).
>>
>> > What about row cache, is it different than memtable?
>>
>> Be careful with row cache. If row cache is enable, then yes, any read
>> in a row will read
>> the entire row. So you typically don't want to use row cache in column
>> family where rows
>> have lots of columns (unless you always read all the columns in the
>> row each time of
>> course).
>>
>> > I have another one question, let's say there is only data to be inserted
>> > and
>> > a solution to it is to have columns to be added to rows in Column
>> > Family, is
>> > it possible in Cassandra to split the row if certain threshold is
>> > reached,
>> > say 100 columns per row, what if there are concurrent inserts?
>>
>> No, cassandra can't do that for you. But you should be okay with what
>> you describe
>> below. That is, if a given row corresponds to an hour of data, it will
>> limit it's size.
>> And again, the number of column in a row is not really limited as long as
>> the
>> overall size of the row fits easily in memory.
>>
>> > The original data model and use case is to insert timestamped data and
>> > to
>> > make range queries. The original keys of CF rows were in the form of
>> > <id>.<timestamp> and then a single column with data, OPP was used. This
>> > is
>> > not an optimal solution, since nodes are hotter than others, I am
>> > thinking
>> > of changing the model in the way to have keys like <id>.<year/month/day>
>> > and
>> > then a list of columns with timestamps within this range and
>> > RandomPartitioner or using OPP but preprocess part of the key with MD5,
>> > i.e.
>> > the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the
>> > problem
>> > is how to deal with large number of columns being inserted in a
>> > particular
>> > row.
>> > Thank you very much!
>> > Best regards, Daniel.
>
>

Re: question about how columns are deserialized in memory

Reply via email to