There is a lot of overhead in the serialized data itself (just have a look at a sstable file).
It would be great to be able to compress at the byte array level rather than string. Regards, Terje On 1 Feb 2011, at 03:15, "David G. Boney" <dbon...@semanticartifacts.com> wrote: > In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, > the modeling is separate from the coding. A standard arrangement is to have a > 0-order model, frequencies of individual bytes, 1-order model, frequencies of > two byte occurrences, and 2-order models, frequencies of three byte > occurrences. For multibyte unicode encodings, which almost all fall in the > two or three byte encodings, the higher order models should capture the > statistics. Having said than, there is nothing to prevent someone from > developing a UTF-8 specific model for capturing the statistics better. > > An adaptive arithmetic coding model starts with the probabilities of all the > symbols being equal. This results in all the symbols using 8-bits per > encoding. The model adapts with each symbol it sees but it takes seeing some > data to start seeing compression. I do not know, or have any reports from the > literature, on how many bytes, on average, it takes to start seeing > compression. All the papers I have seen have studied large blocks of text. > > My assumption is that the majority of strings to compress in Cassandra will > be short string, less then 32 bytes, to medium length strings, less than, 256 > bytes. An example of short strings might be column names. An example of > medium length strings would be URLs. My assumption is that most short strings > would not get much compression from adaptive arithmetic coding compression > but medium length to longer strings would. In an effort to get higher > compression on the short strings, static encoding methods could be used. A > static encoding method uses a hard coded frequency table for the bytes. The > start of the compressed string could have a 2 bit code, 00-no compression, > 01-static, default probability table, 10-static, locale probability table, > 11-adaptive coding. The default static coding would be based on English. For > the static locale probability table, the 2 bit code would need to be followed > by a code table, indicating the probability table to use for a particular > language. The locale probability tables would be stored in the compressor jar > file as a separate class for each locale supported (This issue needs more > analysis, what I don't think is effective is to load the probability tables > from disk each time a new compressor is created). During the encoding phase, > both adaptive and static coding of the string would occur. In general, > compression with a static coding table is very fast. static codig is > basically a table lookup from a table with 256 entries. It is the adaptive > coding that is more computationally intensive. Furthermore, you could place a > limit on the use of static coding, say strings less than 256 bytes. This > would need to be set empirically. The shorter string from the two methods is > used as the compressed string. There is no additional burden on the decoding > side, since you know the type of compression encoding. > > It might be the case that block compressors can achieve greater compression > ratios. From what I have read on the mailing list and JIRA, it appears that > there is a lot of pain involved to implement block based compressors. This > method, a compressed string data type, is presented as a method that is > minimally invasive to the Cassandra architecture. Since everything is already > stored as a byte array, nothing would need to be changed in the internal data > types. Furthermore, no surgery of the internal tables is need. The main > addition is the development of comparators, for keys and column headings, > that know how to decompress the byte arrays before comparing them. There is > also the additional work on writing the codex but there are a number of > examples in C and Java from which to draw. Moffat's web site would be a good > place to start. > > ------------- > Sincerely, > David G. Boney > dbon...@semanticartifacts.com > http://www.semanticartifacts.com > > > > > On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote: > >> On 31 January 2011 04:41, David G. Boney <dbon...@semanticartifacts.com> >> wrote: >>> I propose a simple idea for compression using a compressed string datatype. >>> >>> The compressed string datatype could be implemented for column family keys >>> by creating a compressed string ordered partitioner. The compressed string >>> ordered partitioner works by decompressing the string and then applying an >>> ordered partitioner for strings to the decompressed string. The hash based >>> partitioner would be used with the compressed string without any >>> modification. The compressed string datatype could be implemented for >>> column keys by creating a compressed string comparator. A compressed string >>> comparator would work by decompressing the string and then applying a >>> string comparator. The compressed string datatype could be implemented for >>> column values. The compressed string datatype would be an internal datatype >>> for Cassandra. The compressed string would be converted to a string before >>> returning the value to a client. I suppose you could also have an option of >>> passing the compressed form back to the client if the client wanted it that >>> way. >>> >>> I propose using an adaptive arithmetic coding compressor. This type of >>> compression can be done a byte at a time. It will build a compression model >>> only on the string that is presented, a byte at a time. See the below >>> papers. >>> >>> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic >>> Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294. >>> >>> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic >>> Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, >>> pp. 520-540. >>> >>> It has been reported that arithmetic coding based compression applied to >>> text can get compression ratios of up to 2.2 bits per character. Assuming >>> you only get 4 bits per character because of short strings. This would be a >>> 50% compression of text data, including keys and column names. Many >>> applications would benefit. It should speed up the overall operation of >>> Cassandra because you would be moving significantly less data through the >>> system. >> >> I have not read those papers but how do they and their reported >> compressions apply to the unicode characters that java strings are >> stored as? >> >> -Stephen >> >>> >>> This would provide a compression option that could be implemented without >>> any redesign to the internal structure of Cassandra except for the a new >>> partitioner class, a new comparator class, a new datatype class, and the >>> compression class. >>> ------------- >>> Sincerely, >>> David G. Boney >>> dbon...@semanticartifacts.com >>> http://www.semanticartifacts.com >>> >>> >>> >>> >>> >