Jeff Clites <[EMAIL PROTECTED]> wrote: > On Apr 12, 2004, at 9:54 AM, Leopold Toetsch (via RT) wrote:
> It looks very similar to what I had come up with. The only important > differences are: > 1) My version handles the case of code points > 0xFFFF as well. (The > string_append_chr function encapsulates the logic of dealing with the > "anything above 0xFF" case, but needs to be rewritten to improve > efficiency.) Yep. Using the string_append_chr() function for setting chars/UChars in an existing buffer is overkill. It allocates a new string for each char. We have a maximum length for the 1/2/4 byte encodings. Unescaping doesn't create longer strings, so we can always safely fill an existing buffer (given that it's upscaled beforehand if needed). Anyway. We'll need 2 version of unescape. One with ICU/Unicode and one without. The latter will only deal with chars <= 0xff. BTW we'll need a "not a STRING" encoding too. We need some means for trasparently handling e.g frozen bytecode. We must assure, that such a frozen image goes in and out unaltered. > 2) When I was implementing the previous version of > string_unescape_cstring, I'm pretty sure I had a reason for doing that > string_constant_copy at the end, rather than creating a constant string > at the beginning. I'm not recalling 100% why, but I believe that there > were problems in the case where the string has to expand its storage > because there are characters > 0xFF, if had been created as a constant. No problem with growing here. "constant" here just means, that the string is allocated in the constant string header pool. The only difference is that this pool isn't scanned for dead strings during the collect phase of DOD. The reason might be that currently the only usage of string_unescape is from inside imcc/pbc, where constant strings are generated for the constant table. This usage of the function is a bit special. So we might pass in 2 more parameters to string_unescape: flags ... PObj_constant_FLAG yes/no "uconv" ... e.g. "iso-8859-15" or what not I've currently a modified version of string_unescape that can deal (or should finally, if all bugs are gone ;) with input strings like: "¤" # currency sign but when seen as latin9 character # then it's euro sign PASM/PIR syntax could be something like: :iso-8859-15:"a string ¤" > Just a tiny note: > instead of this: > result->bufused = d * (had_int16 ? 2 : 1); > you can do this: > result->bufused = string_max_bytes(interpreter, result, > result->strlen); Yep. Thanks. > I'm attaching a patch which contains the version I had written, and > also includes my changes from [perl #28473], which I didn't see make it > to the list. Take a look, and you can probably take the best parts of > both--I'm sure there are a few places where your version is more > efficient. (Also, I have the couple of bits which call directly into > the ICU API factored out into string_primitives.c) I'll merge the relevant bits. > BTW, I have some benchmarks that I will clean up and send in to go with > your tests. Good. Thanks. > JEff leo