Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Pavel Tupitsyn Thu, 27 Jul 2017 07:55:48 -0700

> 1 byte for every field just for this
GridBinaryMarshaller.STRING data type remains untouched.
We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte for
encoding type.


This means no overhead for existing code.
I think the most common use case is English, which uses 1 byte per char in
UTF-8.
This is already as fast and compact as possible, and we don't want to
introduce any lookup overhead here.

And when user knows that their data will be more compact in some specific
encoding,
they use some BinaryWriter.writeString overload, which writes a different
type code.

Yes, it also writes an extra byte, but you save a byte per char of the
actual string
(for example, when using Windows-1251 for Russian text), so this does not
matter.

On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <[email protected]>
wrote:

> Pavel, what would be the size overhead? Are we adding 1 byte for every
> field just for this? If you would like to have this info in the binary
> object directly, can we in this case have some bitmap of field-to-encoding?
>
> D.
>
> On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[email protected]>
> wrote:
>
> > I'm not sure I uderstand how this "per field" configuration is supposed
> to
> > be implemented.
> > * Marshaller is not tied to a cache. It serializes all kinds of things,
> > like compute job parameters and results.
> > * Raw mode does not involve field names.
> >
> > Also it seems like a complicated and expensive solution - looking up
> string
> > format somewhere in the metadata will be slow.
> >
> > "encoded string" data type suggestion from Vladimir looks better to me
> from
> > performance and implementation standpoint.
> >
> > Thanks,
> > Pavel
> >
> >
> >
> > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <
> [email protected]>
> > wrote:
> >
> > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[email protected]>
> wrote:
> > >
> > > > Just a note from the platforms guy:
> > > >
> > > > Solution with table-level configuration is going to be significantly
> > > > harder to implement for platforms and ODBC then field-level one.
> > > >
> > >
> > > Igor, it seems like you are advocating the per-cell configuration, not
> > > per-field one. The per-field configuration can be defined at the
> > > table/cache level.
> > >
> > > I see your point about C++ and .NET integrations however. Can't we
> > provide
> > > this info at node-join time or table-creation time? This way all nodes
> > will
> > > receive it and you will be able to grab it on different platforms.
> > >
> > >
> > > >
> > > > Also, what about binary objects, which are not stored in cache,
> > > > but being marshalled?
> > > >
> > >
> > > I think the default system encoding should be used here. If we don't
> have
> > > configuration for default encoding, we should add it.
> > >
> > >
> > > >
> > > >
> > > > Best Regards,
> > > > Igor
> > > >
> > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > > Encoding must be set on per field basis. This will give us as
> > most
> > > > > > flexible
> > > > > > > solution at the cost of 1-byte overhead.
> > > > > >
> > > > > > > Vova, I agree that the encoding should be set on per-field
> basis,
> > > but
> > > > > at
> > > > > > > the table level, not at a cell level.
> > > > > >
> > > > > > Dmitriy, Vladimir,
> > > > > > Let's use both approaches :-)
> > > > > > We can add parameter to CacheConfiguration.
> > > > > > If parameter specifie to use cache level encoding then marshaller
> > > will
> > > > > use
> > > > > > encoding in a cache,
> > > > > > otherwise marshaller will use per-field encoding.
> > > > > > Of course only if it doesn't complicate the solution.
> > > > > >
> > > > > >
> > > > > I think that it will complicate the solution and will complicate
> the
> > > > > marshalling protocol. The advantage of specifying the encoding at
> > > > > table/cache level is that we don't need to add extra encoding bytes
> > to
> > > > the
> > > > > marshalling protocol.
> > > > >
> > > > > I think Vova was suggesting encoding at the cell level, not at the
> > > field
> > > > > level, which seems to be redundant to me.
> > > > >
> > > > > Vova, do you agree?
> > > > >
> > > >
> > >
> >
>

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Reply via email to