Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Pavel Tupitsyn Fri, 28 Jul 2017 01:46:12 -0700

Val, of course other options should be available, such as
BinaryTypeConfiguration,
and maybe field-level and class-level annotations.


On Thu, Jul 27, 2017 at 9:07 PM, Valentin Kulichenko <
[email protected]> wrote:

> Pavel,
>
> This forces user to implement Binarylizable for whole type in case they
> want to change encoding for one-two fields, right? I really don't like it,
> why not add default encoding to BinaryTypeConfiguration?
>
> -Val
>
> On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <[email protected]>
> wrote:
>
> > > 1 byte for every field just for this
> > GridBinaryMarshaller.STRING data type remains untouched.
> > We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte
> for
> > encoding type.
> >
> > This means no overhead for existing code.
> > I think the most common use case is English, which uses 1 byte per char
> in
> > UTF-8.
> > This is already as fast and compact as possible, and we don't want to
> > introduce any lookup overhead here.
> >
> > And when user knows that their data will be more compact in some specific
> > encoding,
> > they use some BinaryWriter.writeString overload, which writes a different
> > type code.
> >
> > Yes, it also writes an extra byte, but you save a byte per char of the
> > actual string
> > (for example, when using Windows-1251 for Russian text), so this does not
> > matter.
> >
> > On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <
> [email protected]>
> > wrote:
> >
> > > Pavel, what would be the size overhead? Are we adding 1 byte for every
> > > field just for this? If you would like to have this info in the binary
> > > object directly, can we in this case have some bitmap of
> > field-to-encoding?
> > >
> > > D.
> > >
> > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[email protected]>
> > > wrote:
> > >
> > > > I'm not sure I uderstand how this "per field" configuration is
> supposed
> > > to
> > > > be implemented.
> > > > * Marshaller is not tied to a cache. It serializes all kinds of
> things,
> > > > like compute job parameters and results.
> > > > * Raw mode does not involve field names.
> > > >
> > > > Also it seems like a complicated and expensive solution - looking up
> > > string
> > > > format somewhere in the metadata will be slow.
> > > >
> > > > "encoded string" data type suggestion from Vladimir looks better to
> me
> > > from
> > > > performance and implementation standpoint.
> > > >
> > > > Thanks,
> > > > Pavel
> > > >
> > > >
> > > >
> > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[email protected]>
> > > wrote:
> > > > >
> > > > > > Just a note from the platforms guy:
> > > > > >
> > > > > > Solution with table-level configuration is going to be
> > significantly
> > > > > > harder to implement for platforms and ODBC then field-level one.
> > > > > >
> > > > >
> > > > > Igor, it seems like you are advocating the per-cell configuration,
> > not
> > > > > per-field one. The per-field configuration can be defined at the
> > > > > table/cache level.
> > > > >
> > > > > I see your point about C++ and .NET integrations however. Can't we
> > > > provide
> > > > > this info at node-join time or table-creation time? This way all
> > nodes
> > > > will
> > > > > receive it and you will be able to grab it on different platforms.
> > > > >
> > > > >
> > > > > >
> > > > > > Also, what about binary objects, which are not stored in cache,
> > > > > > but being marshalled?
> > > > > >
> > > > >
> > > > > I think the default system encoding should be used here. If we
> don't
> > > have
> > > > > configuration for default encoding, we should add it.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards,
> > > > > > Igor
> > > > > >
> > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > > > > [email protected]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > > Encoding must be set on per field basis. This will give us
> as
> > > > most
> > > > > > > > flexible
> > > > > > > > > solution at the cost of 1-byte overhead.
> > > > > > > >
> > > > > > > > > Vova, I agree that the encoding should be set on per-field
> > > basis,
> > > > > but
> > > > > > > at
> > > > > > > > > the table level, not at a cell level.
> > > > > > > >
> > > > > > > > Dmitriy, Vladimir,
> > > > > > > > Let's use both approaches :-)
> > > > > > > > We can add parameter to CacheConfiguration.
> > > > > > > > If parameter specifie to use cache level encoding then
> > marshaller
> > > > > will
> > > > > > > use
> > > > > > > > encoding in a cache,
> > > > > > > > otherwise marshaller will use per-field encoding.
> > > > > > > > Of course only if it doesn't complicate the solution.
> > > > > > > >
> > > > > > > >
> > > > > > > I think that it will complicate the solution and will
> complicate
> > > the
> > > > > > > marshalling protocol. The advantage of specifying the encoding
> at
> > > > > > > table/cache level is that we don't need to add extra encoding
> > bytes
> > > > to
> > > > > > the
> > > > > > > marshalling protocol.
> > > > > > >
> > > > > > > I think Vova was suggesting encoding at the cell level, not at
> > the
> > > > > field
> > > > > > > level, which seems to be redundant to me.
> > > > > > >
> > > > > > > Vova, do you agree?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Reply via email to