(resending- I got the list address wrong.) OK, here is how I think about these things.
I would try to keep the intent of the function in mind, not just the literal definition, when extending them from the 1 byte/character model to the Unicode UTF-8 m bytes/n characters/grapheme model. The literal approach, is to say str_pad works with whole graphemes (base character plus any following combining characters) and tries to fit the string in the length requested (which we could take to be either character or byte units as you prefer.) Looking at some of the str_pad examples, I think we should make this a character count. People are trying to use this for drawing and positioning, not to fit into an exact number of bytes. (I could be wrong.) So str_pad would preprend and append the pad string, and then trim characters, checking for combining characters, and then, if resulting length is not what the user requested, some other work might be done to make it the exact length. But let's be realistic. 99.99% of the time, users will be padding with spaces, or dashes, or something very simple. To do the work of checking and trimming combining characters, and making the length exact, would be development, and testing effort, it would slow the function down, and almost no one would take advantage of it. Also, in a Unicode GUI world, the general case for this function is of limited use. If the user is trying to create a string of certain length on the display, they can't count characters and expect the number of columns will be proportional. Some characters don't display, some characters are two times as wide, some characters combine with a glyph both in front AND behind the base character... If you are trying to match a particular byte length by padding the string out, then there isn't much reason to use a value other than a space or some other single byte character. If the user is padding both left and right to do centering of the string, it also doesn't make sense to use more than simples character like space (or small groups of simple characters). Recognize also, that if STR_PAD doesn't do what the user wants, they have the option to use any funky characters they want, and write some simple code to prepend, and append strings and truncate however they like. So for reasons of performance, simplicity, and practicality, I would say str_pad should: 1) of course surrogates must not be broken up 2) The pad string can have combining characters. 3) The length the user specifies should be a character count. 4) The string can be truncated to the user's requested character length. The string will be trimmed from the right one unicode utf-8 character (not grapheme, not byte) at a time until the length limit is met. (So a combining character is one character for this purpose.) 5) The algorithm will be documented clearly and the user will be warned that if there isn't enough room, diacritics or other combining characters in the pad string might be trimmed leaving just a base character that looks different. 6) If the string to be padded is already longer than the limit, no padding is done, and the original string is NOT truncated. 7) For 99.99% of users there will be no problem with this and little overhead. The one person that wants something more complex, can implement whatever he needs on his/her own easily. Knowing the pad string, his/her algorithm is likely to be more efficient than our general purpose one would have been. Sound ok? Tex Texin Internationalization Architect, Yahoo! Inc. > -----Original Message----- > From: Rolland Santimano [mailto:[EMAIL PROTECTED] > Sent: Monday, August 29, 2005 1:37 AM > To: Tex Texin > Subject: str_pad clarification - Re: [PHP-DEV] PHP Unicode > strings impl proposal > > > --- Andrei Zmievski <[EMAIL PROTECTED]> wrote: > > > [5] string str_pad(string text, int length[, string pad[, int > > > pad_type]]) > > > Returns input string padded on the left and/or right > (determined by > > > pad_type) to specified length with pad string. > > > Impl: > > > The impl builds the output string by copying appropriate pad > > > characters to the left and/or right of the input string. > > > > > > Q: With STR_PAD_BOTH, lets say 'length' == input 'text' > length + 2 > > > (lengths in UChars), but 'pad' text is non-BMP (ie. 2 > UChars), then > > > the 'pad' text can't be added at either end. More generally, the > > > 'pad' text can't be split in the middle of non-BMP codepts or > > > base+combining sequences. If such a condn occurs, an error should > be > > > returned. Any other thoughts ? > > > > We should not split padding strings in the middle of > surrogate pairs. > > As for combining sequences, I would defer to Tex and see > what he has > > to say. The input length parameter should indicate the number of > > codepoints to pad to, not the number of UChars. > > The non-Unicode impl simply truncates the pad string when the > requested length can't hold input + pad strings. In such a > truncation, if we decide to leave out the codept / > base+combining sequence being split, then we would have to > either pad the rest with whitespace or return a string which > is smaller than the requested length. > > Any other thoughts ? > > > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php