Thanks,Carlos.
the key point is how to balance the data spread around the cluster and the
partition number of query.
it is hard to determine which is best.anyway,thanks for your suggestion.it
help me a lot.

stone

On Thu, Sep 1, 2016 at 4:54 PM, Carlos Alonso <i...@mrcalonso.com> wrote:

> I guess there's no easy solution for this. The bucketing technique you
> were applying with the publish_pre extra field making a composite partition
> key is probably your best bet but you're right being concerned that all
> your workload will hit the same node during an hour.
>
> I'd then suggest adding a higher-cardinality extra field, as the second
> number for example. That will spread the load across 60 partitions per
> datacenter and, of course, when querying, you'll have to try all
> partitions. It won't scale forever as the partitions will grow big at some
> point, but if your data is small enough it may work.
>
> Another suggestion would be to use the number of seconds since epoch as
> the extra partitioning key. Instead of querying a range you'll have to
> issue single partition queries for each second within your range. This
> solution may suppose heavier read workloads but will definitely scale as
> the size of the partitions shouldn't be an issue.
>
> I don't think this suggestions will match your requirements off the shelf,
> but hopefully will give you an idea on how to trade off between partition
> size, number of partitions and read strategy to find the sweet spot for
> your use case.
>
> Regards
>
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
>
> On 1 September 2016 at 02:58, Stone Fang <cnstonef...@gmail.com> wrote:
>
>> access pattern is
>>
>> select *from datacenter where datacentername = '' and publish>$time and
>> publish<$time
>>
>> On Wed, Aug 31, 2016 at 8:37 PM, Carlos Alonso <i...@mrcalonso.com>
>> wrote:
>>
>>> Maybe a good question could be:
>>>
>>> Which is your access pattern to this data?
>>>
>>> Carlos Alonso | Software Engineer | @calonso
>>> <https://twitter.com/calonso>
>>>
>>> On 31 August 2016 at 11:47, Stone Fang <cnstonef...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>> have some questions on how to define clustering key.
>>>>
>>>> have a table like this
>>>>
>>>> CREATE TABLE datacenter{
>>>>
>>>> datacentername varchar,
>>>>
>>>> publish timestamp,
>>>>
>>>> value varchar,
>>>>
>>>> PRIMARY KEY(datacentername,publish)
>>>>
>>>> }
>>>>
>>>>
>>>> *issues:*
>>>> there are only two datacenter,so the data would only have two
>>>> partitions.and store
>>>> in two nodes.want to spread the data evenly around the cluster.
>>>>
>>>> take this post for reference
>>>> http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
>>>>
>>>> CREATE TABLE datacenter{
>>>>
>>>> datacentername varchar,
>>>>
>>>> publish_pre text,
>>>>
>>>> publish timestamp,
>>>>
>>>> value varchar,
>>>>
>>>> PRIMARY KEY((datacentername,publish_pre),publish)
>>>>
>>>> }
>>>>
>>>> publish_pre is from 1~12 hours.*but the workload is high.i dont want
>>>> to all workload inserted into one node in a hour.*
>>>>
>>>> have no idea on how to define the partition key to spread data evenly
>>>> around the cluster,and the partition not split by time.which means that
>>>> data should not be inserted one node at a certain time window.
>>>>
>>>> thanks
>>>> stone
>>>>
>>>
>>>
>>
>

Reply via email to