Le duodi 2 vendémiaire, an CCXXIV, James Darnley a écrit : > It is not supposed to replace any invalid bytes with a "random" > character. That sounds like it will only make the problem worse with > that lossy removal of data. This is trying to fix incorrect > interpretation of bytes. > > This feature is to transform bytes into other bytes which when > interpreted and displayed the correct text is appears on screen. > > I will detail my exact use case at the bottom.
Indeed, but a feature like that must be designed with all reasonable use cases in mind, and replacing isolated invalid octets values in input by a fixed replacement character is a common practice, totally acceptable when dealing with large amounts of data. I am not saying that it must be enabled by default, but it needs to be an option. > What do you mean? You need at least two encodings to be given to perform > a conversion. Two things become a list. Yes, but the code to handle more than two is much more complicated than it would have been with just two elements, i.e. only one conversion. > It might not be very good but it is (void*) and NULL if you don't use > the feature. Yes, I understood that, but this is fragile code. > It shouldn't. This function receives buf2_len as equal to BUF_LEN - 1 > which means that iconv can only advance buf2 to buf2 + BUF_LEN - 1 which > will let us write 0 into the last byte. In that case, it should be written very explicitly in the function documentation, otherwise someone may change the code and break the assumption. Also, I notice another flaw in that code: it uses '\0' as string terminator. Text in non-ASCII encodings can contain '\0'. The end of the text must be handled differently, with an explicit size argument or maybe a pointer to the end. > I won't send another patch for a little while. I will see how your API > proposal plays out. > > And now for my tale. > > I wanted ffmpeg to turn the string at [1] into the string at [3]. [1], > with the exact hex bytes at [2], is artist tag out of a Flac file. Flac > files have Vorbis Comment metadata tags. They are UTF-8 only. If a > program puts incorrect data in there how will any other program know how > to display it? What's worse is when the data gets converted twice. Indeed. All modern formats should specify UTF-8 for all strings, there is absolutely no valid reason to do otherwise nowadays. > This specific case was to convert CP1252 to UTF-8 to GBK -- that is to > interpret the input bytes as the CP1252 encoding and convert them to > UTF-8, then take those bytes and convert them to GBK. I added the code > needed to take an argument in the form > > "CP1252,UTF-8,GBK" > parse it into separate encodings, open two iconv contexts, and finally > perform the conversion. I can not reproduce your conversion, but there is something that bugs me in your reasoning. With any sane iconv implementation, converting from A to B then from B to C will either fail if B is too poor to code the input, or succeed and then give the exact same result as converting directly from A to C. As for chaining a conversion from A to B then from C to D with B != C, this is braindead, and if FFmpeg were to support it at all, it should only be with a very explicit request from the user to perform a dangerously lossy conversion. I do not understand what you were trying to achieve with your "CP1252,UTF-8,GBK" conversion. First, why finish with GBK instead of sane UTF-8? And second, if the output is really wanted in GBK, then what is the purpose of the intermediate UTF-8 step? IMHO, the only sane way of dealing with text is to handle everything in UTF-8 internally: demuxers and anything that deals with input should convert as soon as the encoding is known, muxers and anything that deals with output should convert as late as possible. (We do not do that for audio and video because conversions are expensive, but text conversions are cheap and negligible compared to video and audio processing.) Regards, -- Nicolas George
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel