Re: RandomPartitioner is providing a very skewed distribution of keys across a 5-node Solandra cluster

Safdar Kureishy Sun, 24 Jun 2012 12:12:53 -0700

Thanks.
Oh, I forgot to mention that I'm using cassandra 1.1.0-beta2...in case that
question comes up.
Hoping someone can offer some more feedback on the likelyhood of this
behavior ...
Thanks again,
Safdar
On Jun 24, 2012 9:22 PM, "Dave Brosius" <dbros...@mebigfatguy.com> wrote:


>  Well it sounds like this doesn't apply to you.
>
> if you had set up your column family in cql as .... PRIMARY KEY
> (domain_name, path).... or something like that and where looking at lots
> and lots of url pages (domain_name + path), but from a very small number
> domain_names, then the partitioner being just the domain_name could account
> for an uneven distribution.
>
> But it sounds like your key is just a URL so that should (in theory) be
> fine.
>
>
>
> On 06/24/2012 01:53 PM, Safdar Kureishy wrote:
>
> Hi Dave,
>
>  Would you mind elaborating a bit more on that, preferably with an
> example? AFAIK, Solandra uses the unique id of the Solr document as the
> input for calculating the md5 hash for shard/node assignment. In this case
> the ids are just millions of varied web URLs that do *not* adhere to any
> regular expression. I'm not sure if that answers your question below?
>
>  Thanks,
> Safdar
>
> On Sun, Jun 24, 2012 at 8:38 PM, Dave Brosius <dbros...@mebigfatguy.com>wrote:
>
>>  If i read what you are saying, you are _not_ using composite keys?
>> That's one thing that could do it, if the first part of the composite key
>> had a very very low cardinality.
>>
>>
>> On 06/24/2012 11:00 AM, Safdar Kureishy wrote:
>>
>>  Hi,
>>
>>  I've searched online but was unable to find any leads for the problem
>> below. This mailing list seemed the most appropriate place. Apologies in
>> advance if that isn't the case.
>>
>>  I'm running a 5-node Solandra cluster (Solr + Cassandra). I've setup
>> the nodes with tokens *evenly distributed across the token space*, for a
>> 5-node cluster (as evidenced below under the "effective-ownership" column
>> of the "nodetool ring" output). My data is a set of a few million crawled
>> web pages, crawled using Nutch, and also indexed using the "solrindex"
>> command available through Nutch. AFAIK, the key for each document generated
>> from the crawled data is the URL.
>>
>>  Based on the "load" values for the nodes below, despite adding about 3
>> million web pages to this index via the HTTP Rest API (e.g.:
>> http://9.9.9.x:8983/solandra/index/update....), some nodes are still
>> "empty". Specifically, nodes 9.9.9.1 and 9.9.9.3 have just a few kilobytes
>> (shown in *bold* below) of the index, while the remaining 3 nodes are
>> consistently getting hammered by all the data. If the RandomPartioner
>> (which is what I'm using for this cluster) is supposed to achieve an even
>> distribution of keys across the token space, why is it that the data below
>> is skewed in this fashion? Literally, no key was yet been hashed to the
>> nodes 9.9.9.1 and 9.9.9.3 below. Could someone possibly shed some light on
>> this absurdity?.
>>
>>  [me@hm1 solandra-app]$ bin/nodetool -h hm1 ring
>> Address         DC          Rack        Status State   Load
>>  Effective-Owership  Token
>>
>>                  136112946768375385385349842972707284580
>> 9.9.9.0       datacenter1 rack1       Up     Normal  7.57 GB
>> 20.00%              0
>> 9.9.9.1       datacenter1 rack1       Up     Normal  *21.44 KB*
>>  20.00%              34028236692093846346337460743176821145
>> 9.9.9.2       datacenter1 rack1       Up     Normal  14.99 GB
>>  20.00%              68056473384187692692674921486353642290
>> 9.9.9.3       datacenter1 rack1       Up     Normal  *50.79 KB*
>>  20.00%              102084710076281539039012382229530463435
>> 9.9.9.4       datacenter1 rack1       Up     Normal  15.22 GB
>>  20.00%              136112946768375385385349842972707284580
>>
>>  Thanks in advance.
>>
>>  Regards,
>>  Safdar
>>
>>
>>
>
>

Re: RandomPartitioner is providing a very skewed distribution of keys across a 5-node Solandra cluster

Reply via email to