On Thu Dec 31 15:35:38 EET 2020, Nicolas George <geo...@nsup.org> wrote:

…For each simple type, including enumerations like AVColorRange and flat
structures like AVReplayGain, have a set of standardized functions for
common operations, including probably:

- printing;
- serializing to string;
- parsing from string;
…
These functions will have a standardized name and prototype. They will be
grouped in structures that describe a type entirely.

Note: this project requires a good unified string API.
This relates to one of FFmpeg's imperfections: it writes human-readable text to stdout and stderr in an unpredictable and inconsistent encoding. It should be 100% consistently encoded. I suggest it should be Unicode in UTF-8 code form.

One of the places where FFmpeg's inconsistent encoding caused me a problem was when I was operating on a Quicktime video. FFmpeg (or perhaps FFprobe) printed a 4-byte Quicktime tag literally to stdout. The tag's byte sequence was not valid UTF-8. It messed up the output. That tag, being arbitrary binary data, should have been escaped or printed in hex or otherwise represented in valid UTF-8.

I suggest that the type descriptor[1] and Unified string / stream API[2] proposals offer a good opportunity to define two separate data types: string of text, and stream of bytes. Define encode functions to transform text into bytes, and decode functions to transform bytes into text. The Python language str, bytes, and codecs architecture[3] is a pretty good model.

I suggest that FFmpeg define that strings of text always be stored as UTF-8 code units. An argument could be made for defining strings of text as being in any encoding, as long as every single string instance be clearly labelled with its text encoding. (Specifying that all text is in UTF-8 achieves clear labelling with no code.) I suggest requiring that only validly-encoded data shall be permitted in text strings.

FFmpeg code often operates on byte-granularity binary data. These should be defined as data types which are different than "string", because they are not text.

FFmpeg generates human-readable output to stdout, to stderr, and to logs. I suggest that all this output be required to be text strings, preferably always in UTF-8. Any arbitrary binary data written to human-readable output must be encoded or escaped somehow, so that it is represented as valid text.

[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2020-December/274170.html
[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2020-December/274169.html
[3] https://docs.python.org/3/howto/unicode.html

This is an ambitious project. Good luck with it!
       --Jim DeLaHunt, Vancouver, Canada


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to