Using non-UTF8 data in a string field should be understood as incorrect, 
but realistically will work today as long as your messages are only used 
exactly by C++ Protobuf on the current release of protobuf and only ever 
with the binary wire format (not textproto or JSON encoding, etc).

Today the malformed utf8 enforcement exists to different degrees in the 
different languages (and even depending on the syntax of the .proto file), 
but its not semantically intended that a `string` field should be used for 
non-utf8 data in any language. It should be assumed that a serialized 
message with a map<string, ?> where the keys are non-utf8 may start to 
parse-fail in some future release of Protobuf.

Unfortunately bytes as a map key isn't allowed due to obscure technical 
concerns related to some non-C++ languages and the JSON representation, and 
we don't have an immediate plan to relax that.

Realistically your options are:
- Keep doing what you're doing, only ever keep these messages in C++ and 
binary wire encoding, ignore the warnings, know that it might stop working 
if a future release of protobuf
- Make your key data be valid utf8 strings instead (eg, use a base64 
encoding of the digest instead of the raw digest bytes)
- Use repeated of a message with a key and value field instead of a map, 
and use your own struct as the in-memory representation when processing 
(move the data into/out of a STL map at the parse/serialization boundaries 
instead).

Sorry there's not a more trivial fix available for this usecase!

On Thursday, September 5, 2024 at 5:03:03 PM UTC-4 [email protected] wrote:

> Hi,
>
> I've been using protobuf 3.5.1 in c++ and am using a message type with the 
> following map type: `map<string, MyObject> txns = 1`
>
> It is my understanding that `string` and `bytes` are the same in proto 
> c++; for maps however one can only use `string` as keys. I'm using the key 
> field to send around transaction digests which are byte strings consisting 
> of cryptographic hashes. As far as I can tell, it makes no difference 
> whether I use strings/bytes (the decoding works), yet I keep getting the 
> error:
>  
>  `String field 'pequinstore.proto.MergedSnapshot.MergedTxnsEntry.key' 
> contains invalid UTF-8 data when serializing a protocol buffer. Use the 
> 'bytes' type if you intend to send raw bytes.`
>
> I understand the error is complaining about my digests possibly not being 
> UTF-8, but I'm unsure if I actually need to be concerned about it; I have 
> not noticed any problems with parsing. Is there a way to suppress this 
> error?
>
> Or, if this is a serious error that could lead to non-deterministic 
> behavior, do you have a suggested workaround? There is a lot of existing 
> code that uses the map structure akin to an STL map, so I'd like to avoid 
> re-factoring the protobuf into a repeated field if possible. 
>
> Thanks,
> Florian
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/protobuf/fb062f2d-fdde-40ad-9eb9-a7717df0d6afn%40googlegroups.com.

Reply via email to