wouldn't that be ignoring the fact that is just a "prefix" and there is still the unique key after that prefix ;), so yes it may be just as clumpy as using OPP but only within a node which I don't really see as a big deal at that point, or am I missing something? Though maybe the default impl would be 3 bytes so everyone would be happy. main point being that I think cassandra could use OPP underlying like hbase and then expose a RP or OPP selection at column family creation time....that would be nice so I didn't have to write the code myself(and so no one else has to write it themselves).
Any info on #1 and #2??? thanks, Dean On Fri, Sep 9, 2011 at 10:08 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > > > On Fri, Sep 9, 2011 at 10:34 AM, Dean Hiller <d...@alvazan.com> wrote: > >> I saw this quote in the pdf..... >> >> "For large indexes with common terms this too much data! Queries with > >> 100k hits" >> >> 1. What would be considered large? In most of my experience, we have the >> typical size of a RDBMS index but just have many many many more indexes as >> the size of the index is just dependent on our largest partition based on >> how we partition the data. >> >> 2. Does solandra have a lucene api underlying implementation? Our >> preference is to use lucene's api and the underlying implementation could be >> lucene, lucandra or solandra. >> >> 3. Why not just use a 8 bit or 16 bit key as the prefix instead of an sha >> and the rest of the key is unique as the user would have to choose a unique >> key to begin with? After all, the hash only had to be bigger than the max >> number of nodes and 2^16 is quite large. >> >> thanks, >> Dean >> >> >> On Thu, Sep 8, 2011 at 4:10 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: >> >>> >>> >>> On Thu, Sep 8, 2011 at 5:12 PM, Dean Hiller <d...@alvazan.com> wrote: >>> >>>> I was wondering something. Since I can take OPP and I can create a >>>> layer that for certain column families, I hash the key so that some column >>>> families are just like RP but on top of OPP and some of my other column >>>> families are then on OPP directly so I could use lucandra, why not make RP >>>> deprecated and instead allow users to create OPP by column family or RP >>>> where RP == doing the hash of the key on my behalf and prefixing my key >>>> with >>>> that hashcode and stripping it back off when I read it in again. >>>> >>>> ie. why have RP when you could do RP per column family with the above >>>> reasoning on top of OPP and have the best of both worlds????? >>>> >>>> ie. I think of having some column families random and then some column >>>> famiiles ordered so I could range query or use lucandra on top of those >>>> ones. >>>> >>>> thoughts? I was just curious. >>>> thanks, >>>> Dean >>>> >>>> >>> You can use ByteOrderPartitioner and hash data yourself. However that >>> makes every row key will be 128bits larger as the key has to be: >>> >>> md5+originalkey >>> >>> >>> http://www.datastax.com/wp-content/uploads/2011/07/Scaling_Solr_with_Cassandra-CassandraSF2011.pdf >>> >>> Solandra now uses a 'modified' RandomPartitioner. >>> >> >> > I am not quite sure that using 8bit is good enough. It will shard your data > across a small number of nodes effectively, however I can imagine the > SStables will be "clumpy" because you reduce your sorting . It seems like a > http://en.wikipedia.org/wiki/Birthday_problem to me. (I could be wrong) >