As of option #1, it's not so bad. Currently we've implemented global level encoding switch, and this looks similar to DBMS: if server works with certain encoding, then all clients should be configured to use the same encoding for correct string processing.
Option #2 provokes a number of questions. What are performance implications of such hidden binary reencoding? Who will check for possible data loss on transparent reencoding (when object walks between caches/fields with distinct encodings)? How should we handle nested binary objects? On the one hand, they should be reencoded in a way described by Vladimir. On the other hand, BinaryObject is an independent entity, that can be serialized/deserialized freely, moved between various data structures, etc. It will be frustrating for user to find its binary state changed after storing in a grid, with possible data corruption. As far as I can see, we are trying to couple orthogonal APIs: BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is Java-datatype-driven, it creates 1-to-1 mapping between Java types and their binary representations, and now we are trying to map two binary types (STRING and ENCODED_STRING) to single String class. IgniteCache is much more flexible API, than SQL, but it lacks encoded string datatype, that exists in SQLs of some RDBMSs: `varchar(n) character set some_charset`. It's not a popular idea, but many problems could be solved by adding such type. Those IgniteCache API users who don't need it won't use it, but it could become a bridge between SQL and BinaryMarshaller encoded-string types. 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: > What we tried to achieve is that several encoding could co-exist in a > single cluster or even single cache. This would be great from UX > perspective. However, from what Andrey wrote, I understand that this would > be pretty hard to achieve as we rely heavily on similar binary > representation of objects being compared. That said, while this could work > for SQL with some adjustments, we will have severe problems with > BinaryObject.equals(). > > Let's think on how we can resolve this. I see two options: > 1) Allow only single encoding in the whole cluster. Easy to implement, but > very bad from usability perspective. Especially this would affect clients - > client nodes, and what is worse, drivers and thin clients! They all would > have to bother about which encoding to use. But may be we can share this > information during handshake (as every client has a handshake). > > 2) Add custom eocnding flag/ID to object header if non-standard enconding > appears somewhere inside the object (even in nested objects). This way, we > will be able to re-create the object if needed if expected and actual > encoding doesn't match. For example, consider we have two caches/tables > with different encoding (not implemented in current iteration, but we may > decide to implement per-cache encodings in future, as this any RDBMS > support it). And then I decide to move object A from cache 1 with UTF-8 > encoding to cache 2 with Cp1251 encoding. In this case I will detect > encoding mismatch through object header (or footer) and re-build it > transparently for user. > > Second option is more preferable to me as a long-term solution, but would > require =more efforts. > > Thoughts? > > -- Best regards, Andrey Kuznetsov.