Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Hyeonho Kim Tue, 03 Mar 2026 07:32:13 -0800

Hi all!

Unless there are objections, I propose the following:


   1.

   Introduce an opt-in UTF-8 validating SerDe for std::string (validation
   OFF by default).
   2.

   For AoS string items, enable UTF-8 validation at update() by default,
   with an explicit opt-out.

If this direction looks reasonable, I will proceed accordingly in the AoS
PR and follow up with a separate PR for the SerDe option.


Thanks,

Hyeonho

On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]> wrote:

> Thanks all for the feedback.
>
>
> We can preserve backward compatibility for existing C++ users while also
> providing a clear path for cross-language portability.
>
> How do you think about the following approach?
>
> - SerDe with string: Add an option to validate whether the string contains
> valid UTF-8 sequences. The default would be validation OFF to preserve
> existing compatibility.
>
> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast).
> Enabling validation by default, with an explicit opt-out for users who want.
>
>
> For DS-Go, we can follow the same policy as C++.
>
>
> Feedback is welcome.
>
> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]> wrote:
>
>> Gonna agree with Alexander here. I think we should provide a serde option
>> for c++, but that we should not reject non-UTF-8 strings.
>>
>> That wouldn’t just be an API-breaking change. It would break
>> compatibility of c++ with itself for anyone who doesn’t need language
>> portability.
>>
>> A separate utf8_serde option gets my vote.
>>
>>   jon
>>
>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev <
>> [email protected]> wrote:
>>
>>> Regarding C++, I would think that the easiest approach is to instruct
>>> the user to use a UTF8-validating string substitute instead of std::string.
>>> I am not sure whether we should provide such a thing or let the user to
>>> come up with their own implementation.
>>> Consider having a uft8_string that would validate the input in the
>>> constrtuctor but otherwise identical to std::string
>>> So the user can instantiate, for example,
>>> frequent_items_sketch<utf8_string> instead of
>>> frequent_items_sketch<std::string> if validation is necessary.
>>>
>>>
>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]> wrote:
>>>
>>>> Thanks for the feedback. I agree that for container sketches that
>>>> retain and serialize strings, we should validate that string payloads are
>>>> valid UTF-8 sequences to preserve cross-language portability.
>>>>
>>>> On *where* to validate in DS-CPP: validating at update() (ingest time)
>>>> is attractive because it is fail-fast, but it also adds additional cost on
>>>> the hot path. If the community is comfortable with that overhead for
>>>> string-based container sketches, I’m happy to pursue the update()-time
>>>> validation approach.
>>>>
>>>> If performance sensitivity is a concern, an alternative would be to
>>>> always validate at (de)serialization boundaries (to guarantee artifact
>>>> correctness), and optionally provide a “fail-fast” mode that enables
>>>> validation at update() as well.
>>>>
>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit
>>>> simpler in implementation because it provides UTF-8 validation in the
>>>> standard library (unicode/utf8), so we wouldn’t need an external
>>>> dependency for the validator.
>>>>
>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote:
>>>>
>>>>> This issue, raised by Hyeonho Kim, relates to sketches that allow a
>>>>> user to update the sketch with a string and the sketch also retains within
>>>>> the sketch a sample of the input strings seen. When serialized, there is 
>>>>> an
>>>>> implicit assumption that another user, possibly in a different language,
>>>>> can successfully deserialize those sketch images. These sketches include 
>>>>> KLL,
>>>>> REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We
>>>>> informally call these "container" sketches, because they contain actual
>>>>> samples from the input stream.  HLL, Theta, CPC, BloomFilter, etc., are 
>>>>> not
>>>>> container sketches.
>>>>>
>>>>> In the DS-Java library, all container sketches that allow strings
>>>>> always use UTF_8. So the sketch images produced will contain proper UTF_8
>>>>> sequences.
>>>>>
>>>>> In the DS-CPP library, all the various data types are abstracted via
>>>>> templates. The serialization operation is declared similar to
>>>>>
>>>>>
>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is
>>>>> the item type*, os is the output stream and sd* *is the SerDe that
>>>>> performs the conversion to bytes. *
>>>>>
>>>>>
>>>>> If the user wants to use an item of type string, *T* would typically
>>>>> be of type *std::string*, which is just a blob of bytes and no
>>>>> requirement that it is UTF_8.
>>>>>
>>>>>
>>>>> So far, we have trusted users of the library to know that if they
>>>>> update one of these container classes with a type *T,* that the
>>>>> downstream user can successfully decode it. But this could be
>>>>> catastrophic:  A downstream user of a sketch image could be separated from
>>>>> the creation of the sketch image by years and be using a different
>>>>> language.
>>>>>
>>>>> One of the big advantages of our DataSketches project is that our
>>>>> serialization images should be language and platform independent, allowing
>>>>> cross-language and cross platform interchange of sketches.
>>>>>
>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch images
>>>>> that contain strings, those strings must be UTF_8.
>>>>>
>>>>> So how do we implement that?  My thoughts are as follows:
>>>>>
>>>>>    1. We should document now in the website and in appropriate places
>>>>>    in the library the potential danger of not using UTF_8 strings. (At 
>>>>> least
>>>>>    until we have a more robust solution)
>>>>>    2. I think implementing validation checks on UTF_8 strings at the
>>>>>    SerDe boundaries may be too late.  A user could have processed a large
>>>>>    stream of data only to discover a failure at serialization time, which
>>>>>    could be much later in time.  The other possibility would be to 
>>>>> validate
>>>>>    the strings at the input into the sketch, typically in the *update()
>>>>>    *method.
>>>>>    3. For C++, there are 3rd party libraries that specialize in UTF_8
>>>>>    validation, including ICU
>>>>>    
>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$>
>>>>>    , UTF8-CPP
>>>>>    
>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$>
>>>>>    and simjson
>>>>>    
>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>.
>>>>>    (These have standard licensing). From what I've read, UTF-8 
>>>>> validation, if
>>>>>    done correctly, can be done very fast, with only a small section of 
>>>>> code.
>>>>>    4. I am not sure what the solutions are for Rust or Go.
>>>>>
>>>>> I welcome your feedback.
>>>>>
>>>>>
>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote:
>>>>>
>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl
>>>>>> deserializes String values.
>>>>>>
>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82
>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$>
>>>>>>
>>>>>> If it's std::string::String, then it must be of UTF-8 encoding. And
>>>>>> we check the encoding on deserialization.
>>>>>>
>>>>>> However, the Rust ecosystem also supports "strings" that do not use
>>>>>> UTF-8, such as BStr.
>>>>>>
>>>>>> So, my opinions are:
>>>>>>
>>>>>> 1. It's good to assume serialized string data to be valid UTF-8.
>>>>>> 2. Even if it isn't, for datasketches-rust, users should be able to
>>>>>> choose a proper type to deserialize the bytes into a type that doesn't
>>>>>> require UTF-8 encoding.
>>>>>>
>>>>>> Best,
>>>>>> tison.
>>>>>>
>>>>>>
>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道：
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in C++
>>>>>>> (ref: https://github.com/apache/datasketches-cpp/pull/476
>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>),
>>>>>>> a broader design question came up that may affect multiple sketches.
>>>>>>>
>>>>>>> Based on my current understanding:
>>>>>>>
>>>>>>> - In datasketches-java, string serialization already produces valid
>>>>>>> UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So Java-generated
>>>>>>> artifacts already assume valid UTF-8 string encoding.
>>>>>>> - Rust and Python string types represent Unicode text and can be
>>>>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know Rust
>>>>>>> and Python well)
>>>>>>> - In Go, string is a byte sequence and may contain invalid UTF-8
>>>>>>> unless explicitly validated. So during serialization, it may produce
>>>>>>> invalid UTF-8 sequences.
>>>>>>> - In C++, std::string is also a byte container and does not enforce
>>>>>>> UTF-8 validity. So during serialization, it may produce invalid UTF-8
>>>>>>> sequences.
>>>>>>>
>>>>>>> If I am mistaken on any of these points, I would appreciate
>>>>>>> corrections.
>>>>>>>
>>>>>>> If we want to maintain cross-language portability for serialized
>>>>>>> artifacts, one possible approach would be to ensure that any serialized
>>>>>>> string data is valid UTF-8. This could potentially apply to any sketches
>>>>>>> that serialize or deserialize string data.
>>>>>>>
>>>>>>> There seem to be several possible approaches:
>>>>>>> - Validate UTF-8 at serialization boundaries
>>>>>>> - Document that input strings must be valid UTF-8 and rely on caller
>>>>>>> discipline
>>>>>>>
>>>>>>> At this point I am not proposing a specific solution. I would like
>>>>>>> to hear opinions from the community on: We want to require serialized
>>>>>>> string data to be valid UTF-8 for cross-language portability
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Hyeonho
>>>>>>>
>>>>>>

Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Reply via email to