Well you could use a DOUBLE value to encode relative positions... first item in list gets key 1.0 insert before first item -> key[first]/2.0; append after last item -> key[last]*2.0; insert after non-final item -> (key[n]+key[n+1])/2.0
Using double precision should give you quite a space to fit items... you should be able to cleanly do 2^10 appends, or 2^10 insert first before hitting the significands... and even then you have 2^51 of those... If you start to hit issues like heavy segments, you can re-normalize the row. -Stephen On 17 October 2011 10:43, Matthias Pfau <p...@l3s.de> wrote: > Thanks for that hint! However, it seems like soundex is a very language > specific algorithm (US English). We have to get into this topic further... > > Kind regards > Matthias > > On 10/13/2011 10:43 PM, Stephen Connolly wrote: >> >> Then just use a soundex function on the first word in the text... that >> will shrink it sufficiently and give nice buckets in near sequential >> order (http://en.wikipedia.org/wiki/Soundex) >> >> On 13 October 2011 21:21, Matthias Pfau<p...@l3s.de> wrote: >>> >>> Hi Stephen, >>> we are hashing the first 8 byte (8 US-ASCII characters) of text that has >>> been written by humans. Wouldn't it be easy for the attacker to do a >>> dictionary attack on this text, especially if he knows the language of >>> the >>> text? >>> >>> Kind regards >>> Matthias >>> >>> On 10/13/2011 08:20 PM, Stephen Connolly wrote: >>>> >>>> in theory, however they have less than 32 bits of entropy from which >>>> they can do that, leaving them with at least 32 more bits of >>>> combinations to try... that's 2 billion or so... must be a big >>>> dictionary >>>> >>>> - Stephen >>>> >>>> --- >>>> Sent from my Android phone, so random spelling mistakes, random nonsense >>>> words and other nonsense are a direct result of using swype to type on >>>> the screen >>>> >>>> On 13 Oct 2011 17:57, "Matthias Pfau"<p...@l3s.de<mailto:p...@l3s.de>> >>>> wrote: >>>> >>>> Hi Stephen, >>>> this sounds very reasonable. But wouldn't this enable an attacker to >>>> execute dictionary attacks in order to "decrypt" the first 8 bytes >>>> of the plain text? >>>> >>>> Kind regards >>>> Matthias >>>> >>>> On 10/13/2011 05:03 PM, Stephen Connolly wrote: >>>> >>>> It wouldn't be unencrypted... which is the point >>>> >>>> you use a one way linear hash function to take the first, say 8 >>>> bytes, >>>> of unencrypted data and turn it into 4 bytes of a sort prefix. >>>> >>>> You've used lost half the data in the process, so effectively >>>> each bit >>>> is an OR of two bits and you can only infer from 0 values... so >>>> data >>>> is still encrypted, but you have an approximate sorting. >>>> >>>> For example, if your data is US-ASCII text with no numbers, you >>>> could >>>> use Soundex to get the pre-key, so that worst case you have a >>>> bucket >>>> of values in the range. >>>> >>>> Using this technique, a random get will have to get the values >>>> at the >>>> desired prefix +/- a small amount rather than the whole row... >>>> on the >>>> client side you can then decrypt the data and sort that small >>>> bucket >>>> to get the correct index position. >>>> >>>> You could do a 1 byte prefix, but that only gives you at best 256 >>>> buckets and assumes that the first 2 bytes are uniformly >>>> distributed... you've said your data is not uniformly >>>> distributed, so >>>> a linear hash function sounds like your best bet. >>>> >>>> your hash function should have the property that hash(A)>= >>>> hash(B) if >>>> and only if A>= B >>>> >>>> On 13 October 2011 08:47, Matthias Pfau<p...@l3s.de >>>> <mailto:p...@l3s.de>> wrote: >>>> >>>> Hi Stephen, >>>> this is a great idea but unfortunately doesn't work for us >>>> either as we can >>>> not store the data in an unencrypted form. >>>> >>>> Kind regards >>>> Matthias >>>> >>>> On 10/12/2011 07:42 PM, Stephen Connolly wrote: >>>> >>>> >>>> could you prefix the data with 3-4 bytes of a linear >>>> hash of the >>>> unencypted data? it wouldn't be a perfect sort, but >>>> you'd have less of a >>>> range to query to get the sorted values? >>>> >>>> - Stephen >>>> >>>> --- >>>> Sent from my Android phone, so random spelling mistakes, >>>> random nonsense >>>> words and other nonsense are a direct result of using >>>> swype to type on >>>> the screen >>>> >>>> On 12 Oct 2011 17:57, "Matthias Pfau"<p...@l3s.de >>>> <mailto:p...@l3s.de><mailto:pfau@__l3s.de >>>> <mailto:p...@l3s.de>>> >>>> wrote: >>>> >>>> Unfortunately, that is not an option as we have to >>>> store the data in >>>> an compressed and encrypted and therefore binary and >>>> non-sortable form. >>>> >>>> On 10/12/2011 06:39 PM, David McNelis wrote: >>>> >>>> Is it an option to not convert the data to >>>> binary prior to >>>> inserting >>>> into Cassandra? Also, how large are the strings >>>> you're sorting? >>>> If its >>>> viable to not convert to binary before writing >>>> to Cassandra, and >>>> you use >>>> one of the string based column ordering >>>> techniques (utf8, ascii, >>>> for >>>> example), then the data would be sorted without >>>> you needing to >>>> specifically worry about that. Of course, if >>>> the strings are >>>> lengthy >>>> you could run into additional issues. >>>> >>>> On Wed, Oct 12, 2011 at 11:34 AM, Matthias >>>> Pfau<p...@l3s.de<mailto:p...@l3s.de> >>>> <mailto:p...@l3s.de<mailto:p...@l3s.de>> >>>> <mailto:p...@l3s.de >>>> <mailto:p...@l3s.de><mailto:pfa...@l3s.de >>>> <mailto:p...@l3s.de>>>> wrote: >>>> >>>> Hi there, >>>> we are currently building a prototype based >>>> on cassandra and >>>> came >>>> into problems on implementing sorted lists >>>> containing >>>> millions of items. >>>> >>>> The special thing about the items of our >>>> lists is, that >>>> cassandra is >>>> not able to sort them as the data is stored >>>> in a binary >>>> format which >>>> is not sortable. However, we are able to >>>> sort the data >>>> before the >>>> plain data gets encoded (our application is >>>> responsible for >>>> the order). >>>> >>>> First Approach: Storing Lists in >>>> ColumnFamilies >>>> *** >>>> We first tried to map the list to a single >>>> row of a >>>> ColumnFamily in >>>> a way that the index of the list is mapped >>>> to the column >>>> names and >>>> the items of the list to the column values. >>>> The column names >>>> are >>>> increasing numbers which define the sort >>>> order. >>>> This has the major drawback that big parts >>>> of the list have >>>> to be >>>> rewritten on inserts (because the column >>>> names are numbered >>>> by their >>>> index), which are quite common. >>>> >>>> >>>> Second Approach: Storing the whole List as >>>> Binary Data: >>>> *** >>>> We tried to store the compressed list in a >>>> single column. >>>> However, >>>> this is only feasible for smaller lists. Our >>>> lists are far >>>> to big >>>> leading to multi megabyte reads and writes. >>>> As we need to >>>> read and >>>> update the lists quite often, this would put >>>> our Cassandra >>>> cluster >>>> under a lot of pressure. >>>> >>>> Ideal Solution: Native support for storing >>>> lists >>>> *** >>>> We would be very happy with a way to store a >>>> list of sorted >>>> values >>>> without making improper use of column names >>>> for the list >>>> index. This >>>> implies that we would need a possibility to >>>> insert values at >>>> defined >>>> positions. We know that this could lead to >>>> problems with >>>> concurrent >>>> inserts in a distributed environment, but >>>> this is handled by >>>> our >>>> application logic. >>>> >>>> >>>> What are your ideas on that? >>>> >>>> Thanks >>>> Matthias >>>> >>>> >>>> >>>> >>>> -- >>>> *David McNelis* >>>> Lead Software Engineer >>>> Agentis Energy >>>> www.agentisenergy.com >>>> >>>> <http://www.agentisenergy.com><http://__www.agentisenergy.com >>>> <http://www.agentisenergy.com>> >>>> <http://www.agentisenergy.com> >>>> c: 219.384.5143 >>>> <tel:219.384.5143><tel:219.384.5143<tel:219.384.5143>> >>>> >>>> /A Smart Grid technology company focused on >>>> helping consumers of >>>> energy >>>> control an often under-managed resource./ >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> > >