Hello Vyacheslav!

Unfortunately I have not found any efficient algorithms that will allow me
to use external dictionary as a pre-processed data structure. If plain gzip
is used without dictionary, the compression is around 0.7, as opposed to
0.4 that I will get with custom implementation, AFAIR the performance was
also worse. I didn't really try it with dictionary, but I assume
performance will be even worse since it will have to scan dictionary before
getting to actual data.

We have such a huge array of tests that we can just run them all with
compression enabled, see if there are any new failures. But the impact of
my commit is fairly low, it is only triggered when data is written to page
(maybe to WAL also?), and we don't really do much frivolous stuff to pages.

Still, I am very much interested in finding existing compression
implementations with support of external dictionary; I am also very much
interested in having different implementations of compression for Apache
Ignite (such as per page compression) and comparing them by benchmark and
by code impact. I am also very interested in large standard datasets for
Apache Ignite (or generators thereof) so that we can run precise benchmarks
on various compression schemes. If you have any of the following, please
get back to me.

Regards,
-- 
Ilya Kasnacheev


пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <daradu...@gmail.com>:

> Hi Igniters!
>
> Ilya, I'm glad to see one more person who is interested in the
> compression feature in Ignite.
>
> I looked through the pull request and want to share following thoughts:
>
> It's very dangerous using a custom algorithm in this way - you store
> serialized data separate from a dictionary and there are a lot of
> points when we may lose data: rebalancing, serialization errors, node
> rebooting and so on.
>
> I'd suggest the following ways to improve reliability:
> - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
> allows us to decompress data in any situation
> - store the dictionary inside page with data
>
> Also, we have a lot of discussions [1] [2] about compression on
> BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
> strictly against a compression on this level.
> If something has changed since then, you may look through [1] [2] [3]
> I've done a lot of research in algorithms comparison it may be useful
> for you.
>
> [1]
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
> [2]
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
> [3] https://issues.apache.org/jira/browse/IGNITE-3592
> [4] https://issues.apache.org/jira/browse/IGNITE-5226
> [5] https://github.com/daradurvs/ignite-compression
> On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <dma...@apache.org> wrote:
> >
> > >
> > > Currently, the dictionary for decompression is only stored on heap.
> After
> > > restart there's compressed data in the PDS, but there's no dictionary
> :)
> >
> >
> > Basically, it means that I've lost my data, right? How about persisting
> > data to disk.
> >
> > Overall, we need Vladimir Ozerov to check the contribution. He was the
> one
> > who sponsored the IEP and knows the area best.
> >
> > --
> > Denis
> >
> > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
> ilya.kasnach...@gmail.com>
> > wrote:
> >
> > > Hello!
> > >
> > > It is somewhat a part of IEP-20, since I have updated it with this
> > > particular direction.
> > >
> > > Regards,
> > >
> > > --
> > > Ilya Kasnacheev
> > >
> > > 2018-08-24 2:56 GMT+03:00 Denis Magda <dma...@apache.org>:
> > >
> > > > Hi Ilya,
> > > >
> > > > Sounds terrific! Is this part of the following Ignite enhancement
> > > proposal?
> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 20%3A+Data+Compression+in+Ignite
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > > ilya.kasnach...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > My plan was to add a compression section to cache configuration,
> where
> > > > you
> > > > > can enable compression, enable key compression (which has heavier
> > > > > performance implications), adjust dictionary gathering settings,
> and in
> > > > the
> > > > > future possibly choose betwen algorithms. In fact I'm not sure,
> since
> > > my
> > > > > assumption is that you can always just use latest&greatest, but
> maybe
> > > we
> > > > > can have e.g. very fast and not very strong vs. slower but stronger
> > > one.
> > > > >
> > > > > I'm not sure yet if we should share dictionary between all caches
> vs.
> > > > > having separate dictionary for every cache.
> > > > >
> > > > > With regards to data format, of course there will be room for
> further
> > > > > extension.
> > > > >
> > > > > Regards,
> > > > >
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <skoz...@gridgain.com>:
> > > > >
> > > > > > Hi Ilya
> > > > > >
> > > > > > Is there a plan to introduce it as an option of Ignite
> configuration?
> > > > In
> > > > > > that instead the boolean type I suggest to use the enum and
> reserve
> > > the
> > > > > > ability to extend compressions algorithms in future
> > > > > >
> > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > > > ilya.kasnach...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > I want to share with the developer community my compression
> > > > prototype.
> > > > > > >
> > > > > > > Long story short, it compresses BinaryObject's byte[] as they
> are
> > > > > written
> > > > > > > to Durable Memory page, operating on a pre-built dictionary.
> > > Typical
> > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
> custom
> > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
> unaffected
> > > > > > > entirely.
> > > > > > >
> > > > > > > This is akin to DB2's table-level compression[1] but
> independently
> > > > > > > invented.
> > > > > > >
> > > > > > > On Yardstick tests performance hit is -6% with PDS and up to
> -25%
> > > (in
> > > > > > > throughput) with In-Memory loads. It also means you can fit
> ~twice
> > > as
> > > > > > much
> > > > > > > data into the same IM cluster, or have higher ram/disk ratio
> with
> > > PDS
> > > > > > > cluster, saving on hardware or decreasing latency.
> > > > > > >
> > > > > > > The code is available as PR 4295[2] (set
> > > > IGNITE_ENABLE_COMPRESSION=true
> > > > > > to
> > > > > > > activate). Note that it will not presently survive a PDS node
> > > > restart.
> > > > > > > The impact is very small, the patch should be applicable to
> most
> > > 2.x
> > > > > > > releases.
> > > > > > >
> > > > > > > Sure there's a long way before this prototype can have hope of
> > > being
> > > > > > > included, but first I would like to hear input from fellow
> > > igniters.
> > > > > > >
> > > > > > > See also IEP-20[3].
> > > > > > >
> > > > > > > 1.
> > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > > 3.
> > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sergey Kozlov
> > > > > > GridGain Systems
> > > > > > www.gridgain.com
> > > > > >
> > > > >
> > > >
> > >
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Reply via email to