Very good points indeed. I get the compression in Ignite question quite often and Hana reference is a typical lead in.
My personal opinion is still that in Ignite *specifically* the compression is best left to the end-user. But we may need to provide a better facility to inject user's logic here... -- Nikita Ivanov On Tue, Jul 26, 2016 at 9:53 PM, Andrey Kornev <andrewkor...@hotmail.com> wrote: > Dictionary compression requires some knowledge about data being > compressed. For example, for numeric types a range of values must be known > so that the dictionary can be generated. For strings, the number of unique > values of the column is the key piece of input into the dictionary > generation. > SAP HANA is a column-based database system: it stores the fields of the > data tuple individually using the best compression for the given data type > and the particular set of values. HANA has been specifically built as a > general purpose database, rather than as an afterthought layer on top of an > already existing distributed cache. > On the other hand, Ignite is a distributed cache implementation (a pretty > good one!) that in general requires no schema and stores its data in the > row-based fashion. Its current design doesn't land itself readily to the > kind of optimizations HANA provides out of the box. > For the curios types among us, the implementation details of HANA are well > documented in "In-memory Data Management", by Hasso Plattner & Alexander > Zeier. > Cheers > Andrey > _____________________________ > From: Alexey Kuznetsov <akuznet...@gridgain.com<mailto: > akuznet...@gridgain.com>> > Sent: Tuesday, July 26, 2016 5:36 AM > Subject: Re: Data compression in Ignite 2.0 > To: <dev@ignite.apache.org<mailto:dev@ignite.apache.org>> > > > Sergey Kozlov wrote: > >> For approach 1: Put a large object into a partition cache will > force to update > the dictionary placed on replication cache. It may be time-expense > operation. > The dictionary will be built only once. And we could control what should be > put into dictionary, for example, we could check min and max size and > decide - put value to dictionary or not. > > >> Approach 2-3 are make sense for rare cases as Sergi commented. > But it is better at least have a possibility to plug user code for > compression than not to have it at all. > > >> Also I see a danger of OOM if we've got high compression level and try > to restore original value in memory. > We could easily get OOM with many other operations right now without > compression, I think it is not an issue, we could add a NOTE to > documentation about such possibility. > > Andrey Kornev wrote: > >> ... in general I think compression is a great data. The cleanest way to > achieve that would be to just make it possible to chain the marshallers... > I think it is also good idea. And looks like it could be used for > compression with some sort of ZIP algorithm, but how to deal with > compression by dictionary substitution? > We need to build dictionary first. Any ideas? > > Nikita Ivanov wrote: > >> SAP Hana does the compression by 1) compressing SQL parameters before > execution... > Looks interesting, but my initial point was about compression of cache > data, not SQL queries. > My idea was to make compression transparent for SQL engine when it will > lookup for data. > > But idea of compressing SQL queries result looks very interesting, because > it is known fact, that SQL engine could consume quite a lot of heap for > storing result sets. > I think this should be discussed in separate thread. > > Just for you information, in first message I mentioned that DB2 has > compression by dictionary and according to them it is possible to > compress usual data to 50-80%. > I have some experience with DB2 and can confirm this. > > -- > Alexey Kuznetsov > > >