On Fri, 01 Jun 2012 11:31:13 -0700, Stas Malyshev wrote:
BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return
One thing I notice here is that with this API it is not possible to
programmatically choose what is the iteration unit - you'd have to do
a
switch for that. Do you think it may be a good idea to have a generic
function that allows to choose the unit programmatically?
You can create a RuleBasedBreakIterator with any rules you choose. The
rules are basically a set of regex expressions; ICU has two matching
modes -- by default it tries the longest match, but it can also chain
together rules. There are rules to advance, to go back and to go to a
safe position from an arbitrary position in the two directions. The ICU
user guide to which I linked in the first e-mail has more details.
What is the notion of characters - is it grapheme characters? Is
there
option to iterate over code points too - not sure if it's useful just
curious, as we used to have it in PHP 6 IIRC.
Yes, they are grapheme clusters. ICU has a special rule for Thai, but
from I see in the tracker, it's obsolete with recent versions of Unicode
(possibly the root rule is now generic enough).
To iterate over code points, you can build a very simple
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this
example here: https://gist.github.com/2843005
About getAvailableLocales() - what this actually does? Does it list
all
avaliable locales in the system, ones that have BreakIterator rules,
or
something else? If it's not related to BI, I'm not sure we need to
have
it in BI. What is the intended usage of it? Maybe it should be part
of
Locale class?
Right now, the ICU implementation just calls
Locale::getAvailableLocales(), but its description is "Gets all the
available locales that has localized text boundary data." so I suppose
it could return a different set in the future.
Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for
Doesn't it have a notion of current position? If so, key should be
the
current position.
Will this BreakIterator be usable in foreach? I'm not sure I
understand
it from this description - understanding this without any usage
examples, RFCs or code snippets for intended usage is really hard and
I
think we should really start with doing that. I would expect this
class
to work like this:
foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
echo "Word number $i is $word\n";
}
or at least like this:
foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
echo "Next word at position $i is: $word\n";
}
Is it the model? If not, I think we need to wrap the C API to make
this
possible, because this is what people expect in PHP from the
iterator.
My options here were: the BreakIterator mirrors the ICU homonym -- it
iterates over breaks, i.e., boundaries in the text. Hence, the iterators
returns the *positions* of the several boundaries. Therefore, this
cannot be used also for the key.
Acknowledging that getting the text between the boundaries was going to
be a common scenario, I added a method, getPartsIterator(), that yields
the text between each boundary. Hence, there is one less element in this
iterator than in the BreakIterator.
Neither of the iterators implement getKey(), so one traversing the keys
will be 0, 1, 2... It would probably be a good a idea to change the
parts iterator to give the left boundary as the key. That way on could
do:
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
foreach ($bi->getPartsIterator() as $k => $v) {
echo "$v is at position $k\n";
}
instead of
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
$pos = $bi->first();
foreach ($bi->getPartsIterator() as $v) {
echo "$v is at position $pos\n";
$pos = $bi->current();
}
Another possibility would be to have the break iterator itself behave
as the parts iterator for iteration purposes. I don't think that is a
good idea. Even though BreakIterator does not implement Iterator, people
would expect next() and current() return the next and current iterator
value, while they would be returning the iteration key.
By the way, you can look at the test cases in the tree on github for
examples:
https://github.com/cataphract/php-src/commit/d289c3977ed4ba8d9ba127e5af9f709b19b8e1ba
Thanks for the comments!
--
Gustavo Lopes
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php