[PHP-DEV] RE: str_pad clarification - Re: [PHP-DEV] PHP Unicode strings impl proposal

Tex Texin Mon, 29 Aug 2005 05:52:36 -0700

(resending- I got the list address wrong.)
OK, here is how I think about these things.

I would try to keep the intent of the function in mind, not just the literal
definition, when extending them from the 1 byte/character model to the
Unicode UTF-8 m bytes/n characters/grapheme model.

The literal approach, is to say str_pad works with whole graphemes (base
character plus any following combining characters) and tries to fit the
string in the length requested (which we could take to be either character
or byte units as you prefer.) Looking at some of the str_pad examples, I
think we should make this a character count. People are trying to use this
for drawing and positioning, not to fit into an exact number of bytes. (I
could be wrong.)

So str_pad would preprend and append the pad string, and then trim
characters, checking for combining characters, and then, if resulting length
is not what the user requested, some other work might be done to make it the
exact length.

But let's be realistic. 99.99% of the time, users will be padding with
spaces, or dashes, or something very simple. To do the work of checking and
trimming combining characters, and making the length exact, would be
development, and testing effort, it would slow the function down, and almost
no one would take advantage of it.

Also, in a Unicode GUI world, the general case for this function is of
limited use. If the user is trying to create a string of certain length on
the display, they can't count characters and expect the number of columns
will be proportional. Some characters don't display, some characters are two
times as wide, some characters combine with a glyph both in front AND behind
the base character... If you are trying to match a particular byte length by
padding the string out, then there isn't much reason to use a value other
than a space or some other single byte character. If the user is padding
both left and right to do centering of the string, it also doesn't make
sense to use more than simples character like space (or small groups of
simple characters).

Recognize also, that if STR_PAD doesn't do what the user wants, they have
the option to use any funky characters they want, and write some simple code
to prepend, and append strings and truncate however they like.

So for reasons of performance, simplicity, and practicality, I would say
str_pad should:
1) of course surrogates must not be broken up
2) The pad string can have combining characters.
3) The length the user specifies should be a character count.
4) The string can be truncated to the user's requested character length. The
string will be trimmed from the right one unicode utf-8 character (not
grapheme, not byte) at a time until the length limit is met. (So a combining
character is one character for this purpose.)

5) The algorithm will be documented clearly and the user will be warned that
if there isn't enough room, diacritics or other combining characters in the
pad string might be trimmed leaving just a base character that looks
different. 
6) If the string to be padded is already longer than the limit, no padding
is done, and the original string is NOT truncated.

7) For 99.99% of users there will be no problem with this and little
overhead. The one person that wants something more complex, can implement
whatever he needs on his/her own easily. Knowing the pad string, his/her
algorithm is likely to be more efficient than our general purpose one would
have been.

Sound ok?

Tex Texin
Internationalization Architect,   Yahoo! Inc.

> -----Original Message-----
> From: Rolland Santimano [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 29, 2005 1:37 AM
> To: Tex Texin
> Subject: str_pad clarification - Re: [PHP-DEV] PHP Unicode 
> strings impl proposal
> 
> 
> --- Andrei Zmievski <[EMAIL PROTECTED]> wrote:
> > > [5] string str_pad(string text, int length[, string pad[, int
> > > pad_type]])
> > > Returns input string padded on the left and/or right
> (determined by
> > > pad_type) to specified length with pad string.
> > > Impl:
> > > The impl builds the output string by copying appropriate pad
> > > characters to the left and/or right of the input string.
> > >
> > > Q: With STR_PAD_BOTH, lets say 'length' == input 'text'
> length + 2
> > > (lengths in UChars), but 'pad' text is non-BMP (ie. 2
> UChars), then
> > > the 'pad' text can't be added at either end. More generally, the
> > > 'pad' text can't be split in the middle of non-BMP codepts or
> > > base+combining sequences. If such a condn occurs, an error should
> be
> > > returned. Any other thoughts ?
> > 
> > We should not split padding strings in the middle of
> surrogate pairs.
> > As for combining sequences, I would defer to Tex and see
> what he has
> > to say. The input length parameter should indicate the number of
> > codepoints to pad to, not the number of UChars.
> 
> The non-Unicode impl simply truncates the pad string when the
> requested length can't hold input + pad strings. In such a 
> truncation, if we decide to leave out the codept / 
> base+combining sequence being split, then we would have to
> either pad the rest with whitespace or return a string which
> is smaller than the requested length.
> 
> Any other thoughts ?
> 
> 
> 

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] RE: str_pad clarification - Re: [PHP-DEV] PHP Unicode strings impl proposal

Reply via email to