Well it sounds like this doesn't apply to you.
if you had set up your column family in cql as .... PRIMARY KEY
(domain_name, path).... or something like that and where looking at lots
and lots of url pages (domain_name + path), but from a very small number
domain_names, then the partitioner being just the domain_name could
account for an uneven distribution.
But it sounds like your key is just a URL so that should (in theory) be
fine.
On 06/24/2012 01:53 PM, Safdar Kureishy wrote:
Hi Dave,
Would you mind elaborating a bit more on that, preferably with an
example? AFAIK, Solandra uses the unique id of the Solr document as
the input for calculating the md5 hash for shard/node assignment. In
this case the ids are just millions of varied web URLs that do /not/
adhere to any regular expression. I'm not sure if that answers your
question below?
Thanks,
Safdar
On Sun, Jun 24, 2012 at 8:38 PM, Dave Brosius
<dbros...@mebigfatguy.com <mailto:dbros...@mebigfatguy.com>> wrote:
If i read what you are saying, you are _not_ using composite keys?
That's one thing that could do it, if the first part of the
composite key had a very very low cardinality.
On 06/24/2012 11:00 AM, Safdar Kureishy wrote:
Hi,
I've searched online but was unable to find any leads for the
problem below. This mailing list seemed the most appropriate
place. Apologies in advance if that isn't the case.
I'm running a 5-node Solandra cluster (Solr + Cassandra). I've
setup the nodes with tokens /evenly distributed across the token
space/, for a 5-node cluster (as evidenced below under the
"effective-ownership" column of the "nodetool ring" output). My
data is a set of a few million crawled web pages, crawled using
Nutch, and also indexed using the "solrindex" command available
through Nutch. AFAIK, the key for each document generated from
the crawled data is the URL.
Based on the "load" values for the nodes below, despite adding
about 3 million web pages to this index via the HTTP Rest API
(e.g.: http://9.9.9.x:8983/solandra/index/update....), some nodes
are still "empty". Specifically, nodes 9.9.9.1 and 9.9.9.3 have
just a few kilobytes (shown in *bold* below) of the index, while
the remaining 3 nodes are consistently getting hammered by all
the data. If the RandomPartioner (which is what I'm using for
this cluster) is supposed to achieve an even distribution of keys
across the token space, why is it that the data below is skewed
in this fashion? Literally, no key was yet been hashed to the
nodes 9.9.9.1 and 9.9.9.3 below. Could someone possibly shed some
light on this absurdity?.
[me@hm1 solandra-app]$ bin/nodetool -h hm1 ring
Address DC Rack Status State Load
Effective-Owership Token
136112946768375385385349842972707284580
9.9.9.0 datacenter1 rack1 Up Normal 7.57 GB
20.00% 0
9.9.9.1 datacenter1 rack1 Up Normal *21.44 KB*
20.00% 34028236692093846346337460743176821145
9.9.9.2 datacenter1 rack1 Up Normal 14.99 GB
20.00% 68056473384187692692674921486353642290
9.9.9.3 datacenter1 rack1 Up Normal *50.79 KB*
20.00% 102084710076281539039012382229530463435
9.9.9.4 datacenter1 rack1 Up Normal 15.22 GB
20.00% 136112946768375385385349842972707284580
Thanks in advance.
Regards,
Safdar