Using non-UTF8 data in a string field should be understood as incorrect, but realistically will work today as long as your messages are only used exactly by C++ Protobuf on the current release of protobuf and only ever with the binary wire format (not textproto or JSON encoding, etc).
Today the malformed utf8 enforcement exists to different degrees in the different languages (and even depending on the syntax of the .proto file), but its not semantically intended that a `string` field should be used for non-utf8 data in any language. It should be assumed that a serialized message with a map<string, ?> where the keys are non-utf8 may start to parse-fail in some future release of Protobuf. Unfortunately bytes as a map key isn't allowed due to obscure technical concerns related to some non-C++ languages and the JSON representation, and we don't have an immediate plan to relax that. Realistically your options are: - Keep doing what you're doing, only ever keep these messages in C++ and binary wire encoding, ignore the warnings, know that it might stop working if a future release of protobuf - Make your key data be valid utf8 strings instead (eg, use a base64 encoding of the digest instead of the raw digest bytes) - Use repeated of a message with a key and value field instead of a map, and use your own struct as the in-memory representation when processing (move the data into/out of a STL map at the parse/serialization boundaries instead). Sorry there's not a more trivial fix available for this usecase! On Thursday, September 5, 2024 at 5:03:03 PM UTC-4 [email protected] wrote: > Hi, > > I've been using protobuf 3.5.1 in c++ and am using a message type with the > following map type: `map<string, MyObject> txns = 1` > > It is my understanding that `string` and `bytes` are the same in proto > c++; for maps however one can only use `string` as keys. I'm using the key > field to send around transaction digests which are byte strings consisting > of cryptographic hashes. As far as I can tell, it makes no difference > whether I use strings/bytes (the decoding works), yet I keep getting the > error: > > `String field 'pequinstore.proto.MergedSnapshot.MergedTxnsEntry.key' > contains invalid UTF-8 data when serializing a protocol buffer. Use the > 'bytes' type if you intend to send raw bytes.` > > I understand the error is complaining about my digests possibly not being > UTF-8, but I'm unsure if I actually need to be concerned about it; I have > not noticed any problems with parsing. Is there a way to suppress this > error? > > Or, if this is a serious error that could lead to non-deterministic > behavior, do you have a suggested workaround? There is a lot of existing > code that uses the map structure akin to an STL map, so I'd like to avoid > re-factoring the protobuf into a repeated field if possible. > > Thanks, > Florian > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/protobuf/fb062f2d-fdde-40ad-9eb9-a7717df0d6afn%40googlegroups.com.
