Thanks. Oh, I forgot to mention that I'm using cassandra 1.1.0-beta2...in case that question comes up. Hoping someone can offer some more feedback on the likelyhood of this behavior ... Thanks again, Safdar On Jun 24, 2012 9:22 PM, "Dave Brosius" <dbros...@mebigfatguy.com> wrote:
> Well it sounds like this doesn't apply to you. > > if you had set up your column family in cql as .... PRIMARY KEY > (domain_name, path).... or something like that and where looking at lots > and lots of url pages (domain_name + path), but from a very small number > domain_names, then the partitioner being just the domain_name could account > for an uneven distribution. > > But it sounds like your key is just a URL so that should (in theory) be > fine. > > > > On 06/24/2012 01:53 PM, Safdar Kureishy wrote: > > Hi Dave, > > Would you mind elaborating a bit more on that, preferably with an > example? AFAIK, Solandra uses the unique id of the Solr document as the > input for calculating the md5 hash for shard/node assignment. In this case > the ids are just millions of varied web URLs that do *not* adhere to any > regular expression. I'm not sure if that answers your question below? > > Thanks, > Safdar > > On Sun, Jun 24, 2012 at 8:38 PM, Dave Brosius <dbros...@mebigfatguy.com>wrote: > >> If i read what you are saying, you are _not_ using composite keys? >> That's one thing that could do it, if the first part of the composite key >> had a very very low cardinality. >> >> >> On 06/24/2012 11:00 AM, Safdar Kureishy wrote: >> >> Hi, >> >> I've searched online but was unable to find any leads for the problem >> below. This mailing list seemed the most appropriate place. Apologies in >> advance if that isn't the case. >> >> I'm running a 5-node Solandra cluster (Solr + Cassandra). I've setup >> the nodes with tokens *evenly distributed across the token space*, for a >> 5-node cluster (as evidenced below under the "effective-ownership" column >> of the "nodetool ring" output). My data is a set of a few million crawled >> web pages, crawled using Nutch, and also indexed using the "solrindex" >> command available through Nutch. AFAIK, the key for each document generated >> from the crawled data is the URL. >> >> Based on the "load" values for the nodes below, despite adding about 3 >> million web pages to this index via the HTTP Rest API (e.g.: >> http://9.9.9.x:8983/solandra/index/update....), some nodes are still >> "empty". Specifically, nodes 9.9.9.1 and 9.9.9.3 have just a few kilobytes >> (shown in *bold* below) of the index, while the remaining 3 nodes are >> consistently getting hammered by all the data. If the RandomPartioner >> (which is what I'm using for this cluster) is supposed to achieve an even >> distribution of keys across the token space, why is it that the data below >> is skewed in this fashion? Literally, no key was yet been hashed to the >> nodes 9.9.9.1 and 9.9.9.3 below. Could someone possibly shed some light on >> this absurdity?. >> >> [me@hm1 solandra-app]$ bin/nodetool -h hm1 ring >> Address DC Rack Status State Load >> Effective-Owership Token >> >> 136112946768375385385349842972707284580 >> 9.9.9.0 datacenter1 rack1 Up Normal 7.57 GB >> 20.00% 0 >> 9.9.9.1 datacenter1 rack1 Up Normal *21.44 KB* >> 20.00% 34028236692093846346337460743176821145 >> 9.9.9.2 datacenter1 rack1 Up Normal 14.99 GB >> 20.00% 68056473384187692692674921486353642290 >> 9.9.9.3 datacenter1 rack1 Up Normal *50.79 KB* >> 20.00% 102084710076281539039012382229530463435 >> 9.9.9.4 datacenter1 rack1 Up Normal 15.22 GB >> 20.00% 136112946768375385385349842972707284580 >> >> Thanks in advance. >> >> Regards, >> Safdar >> >> >> > >