On Fri, 01 Jun 2012 11:31:13 -0700, Stas Malyshev wrote:

BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of "characters", createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return

One thing I notice here is that with this API it is not possible to
programmatically choose what is the iteration unit - you'd have to do a
switch for that. Do you think it may be a good idea to have a generic
function that allows to choose the unit programmatically?

You can create a RuleBasedBreakIterator with any rules you choose. The rules are basically a set of regex expressions; ICU has two matching modes -- by default it tries the longest match, but it can also chain together rules. There are rules to advance, to go back and to go to a safe position from an arbitrary position in the two directions. The ICU user guide to which I linked in the first e-mail has more details.

What is the notion of characters - is it grapheme characters? Is there
option to iterate over code points too - not sure if it's useful just
curious, as we used to have it in PHP 6 IIRC.

Yes, they are grapheme clusters. ICU has a special rule for Thai, but from I see in the tracker, it's obsolete with recent versions of Unicode (possibly the root rule is now generic enough).

To iterate over code points, you can build a very simple RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this example here: https://gist.github.com/2843005


About getAvailableLocales() - what this actually does? Does it list all avaliable locales in the system, ones that have BreakIterator rules, or something else? If it's not related to BI, I'm not sure we need to have it in BI. What is the intended usage of it? Maybe it should be part of
Locale class?

Right now, the ICU implementation just calls Locale::getAvailableLocales(), but its description is "Gets all the available locales that has localized text boundary data." so I suppose it could return a different set in the future.

Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for

Doesn't it have a notion of current position? If so, key should be the
current position.

Will this BreakIterator be usable in foreach? I'm not sure I understand
it from this description - understanding this without any usage
examples, RFCs or code snippets for intended usage is really hard and I think we should really start with doing that. I would expect this class
to work like this:

foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
   echo "Word number $i is $word\n";
}

or at least like this:

foreach(BreakIterator::createWordInstance("blah blah blah") as $i =>
$word) {
   echo "Next word at position $i is: $word\n";
}

Is it the model? If not, I think we need to wrap the C API to make this possible, because this is what people expect in PHP from the iterator.

My options here were: the BreakIterator mirrors the ICU homonym -- it iterates over breaks, i.e., boundaries in the text. Hence, the iterators returns the *positions* of the several boundaries. Therefore, this cannot be used also for the key.

Acknowledging that getting the text between the boundaries was going to be a common scenario, I added a method, getPartsIterator(), that yields the text between each boundary. Hence, there is one less element in this iterator than in the BreakIterator.

Neither of the iterators implement getKey(), so one traversing the keys will be 0, 1, 2... It would probably be a good a idea to change the parts iterator to give the left boundary as the key. That way on could do:

$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
foreach ($bi->getPartsIterator() as $k => $v) {
    echo "$v is at position $k\n";
}

instead of

$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);
$pos = $bi->first();
foreach ($bi->getPartsIterator() as $v) {
    echo "$v is at position $pos\n";
    $pos = $bi->current();
}

Another possibility would be to have the break iterator itself behave as the parts iterator for iteration purposes. I don't think that is a good idea. Even though BreakIterator does not implement Iterator, people would expect next() and current() return the next and current iterator value, while they would be returning the iteration key.

By the way, you can look at the test cases in the tree on github for examples: https://github.com/cataphract/php-src/commit/d289c3977ed4ba8d9ba127e5af9f709b19b8e1ba

Thanks for the comments!

--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to