On Thu, May 10, 2012 at 9:05 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Kewl.
>
> I'd be interested to know what you come up with.
>


Hi,

it's taken some thought, however we know have a data model that we like and
it does indeed make our major concerns non-existent ;-) So I thought I'd
explain it in case someone is doing something sufficiently close to us that
the model is useful (and in case we are doing something silly - I hope not).

The data set consist of 30 years of daily data for several million
entities, the data for each is a small number of different record types (<
10) where <entity,date,record_type> is unique. Each record_type can have a
couple of hundred key/value pairs.

The query that we need to do is

  Set_of_Values = Get(<set_of_entities>, <date_range>, <set_of_keys>)

Where set_of_keys is likely to be most of the keys that are valid for the
entities.

One slight complication (the one that sparked my initial question) is that
there are also corrections that completely replace the data for an
<entity,date,record_type>, multiple versions of the corrections can be
transmitted, but only one correction per entity/day/record_type

The data model that we have designed has a single Column Family keyed by
the entity with a composite column name consisting of
<date,version,record_type> with the value being a protobuf packing of the
key/value pairs from the record. The version is the 'receipt data of the
data' - 'date the data is for'. The properties of this that we like are:-

* Record insertion is idempotent allowing for multiple active/active order
independent loaders, this is a really big win for us(1).
* The random partitioner gives us good scalability across the entity
dimension which is the largest dimension.
* The column ordering makes it easy to find the most recent 'correct' value
for an entity on a day.
* The Column ordering give us reasonably efficient date range queries

There are a couple of implications of this data model:-

* We store more data than we have to in the ideal world.
* we push the work of decoding/extracting information of the protobuf on to
the clients along with some of the version management.

My view is that this a reasonable trade-off for systems that can have large
numbers of clients that are independent of each other as scaling client
machines is not hard.

Feedback welcome

cheers

(1) It's important as it allows us to use a large number of loading
processes to insert the historical data that is pretty large in a short
period of time.


>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 10/05/2012, at 3:03 PM, Franc Carter wrote:
>
>
>
> On Tue, May 8, 2012 at 8:21 PM, Franc Carter <franc.car...@sirca.org.au>wrote:
>
>> On Tue, May 8, 2012 at 8:09 PM, aaron morton <aa...@thelastpickle.com>wrote:
>>
>>> Can you store the corrections in a separate CF?
>>>
>>
> We sat down and thought about this harder - it looks like a good solution
> for us that may makel other hard problems go away - thanks.
>
> cheers
>
>
>> Yes, I thought of that, but that turns on read in to two ;-(
>>
>>
>>>
>>> When the client reads the key, reads from the original the corrects CF
>>> at the same time. Apply the correction only on the client side.
>>>
>>> When you have confirmed the ingest has completed, run a background jobs
>>> to apply the corrections, store the updated values and delete the
>>> correction data.
>>>
>>
>> I was thinking down this path, but I ended up chasing the rabbit down a
>> deep hole of race conditions . . .
>>
>> cheers
>>
>>
>>>
>>> Cheers
>>>
>>>   -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 8/05/2012, at 9:35 PM, Franc Carter wrote:
>>>
>>>
>>> Hi,
>>>
>>> I'm wondering if there is a common 'pattern' to address a scenario we
>>> will have to deal with.
>>>
>>> We will be storing a set of Column/Value pairs per Key where the
>>> Column/Values are read from a set of files that we download regularly. We
>>> need the loading to be resilient and we can receive corrections for some of
>>> the Column/Values that can only be loaded after the initial data has been
>>> inserted.
>>>
>>> The challenge we have is that we have a strong preference for
>>> active/active loading of data and can't see how to achieve this without
>>> some form of serialisation (which Cassandra doesn't support - correct ?)
>>>
>>> thanks
>>>
>>> --
>>> *Franc Carter* | Systems architect | Sirca Ltd
>>>  <marc.zianideferra...@sirca.org.au>
>>> franc.car...@sirca.org.au | www.sirca.org.au
>>> Tel: +61 2 9236 9118
>>>  Level 9, 80 Clarence St, Sydney NSW 2000
>>> PO Box H58, Australia Square, Sydney NSW 1215
>>>
>>>
>>>
>>
>>
>> --
>> *Franc Carter* | Systems architect | Sirca Ltd
>>  <marc.zianideferra...@sirca.org.au>
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 9236 9118
>>  Level 9, 80 Clarence St, Sydney NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>
>
> --
> *Franc Carter* | Systems architect | Sirca Ltd
>  <marc.zianideferra...@sirca.org.au>
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118
>  Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 <marc.zianideferra...@sirca.org.au>

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

Reply via email to