Hi!

> You can create a RuleBasedBreakIterator with any rules you choose. The 

I understand that, but I have no idea how to write proper rules for word
boundaries, I just want to tell it "give me word boundaries" but not by
saying createWordBoundaries() but by doing createIterator($type) where
$type == WORD_BOUNDARIES.

> To iterate over code points, you can build a very simple 
> RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this 
> example here: https://gist.github.com/2843005

Is there any reason not to provide this as a service for PHP user? I
understand somebody who is a specialist in ICU knows that already, but
most PHP users don't know this magic.

> Right now, the ICU implementation just calls 
> Locale::getAvailableLocales(), but its description is "Gets all the 
> available locales that has localized text boundary data." so I suppose 
> it could return a different set in the future.

My only concern is that no other classes have getAvailableLocales() and
it doesn't seem to do anything useful now, so maybe we should omit it
for now?

> Acknowledging that getting the text between the boundaries was going to 
> be a common scenario, I added a method, getPartsIterator(), that yields 
> the text between each boundary. Hence, there is one less element in this 
> iterator than in the BreakIterator.
> 
> Neither of the iterators implement getKey(), so one traversing the keys 
> will be 0, 1, 2... It would probably be a good a idea to change the 
> parts iterator to give the left boundary as the key. That way on  could 
> do:
> 
> $bi = BreakIterator::createWordInstance(NULL);
> $bi->setText($foo);
> foreach ($bi->getPartsIterator() as $k => $v) {
>      echo "$v is at position $k\n";
> }

Another thing I notice here: why not make:
$bi = BreakIterator::createWordInstance(NULL);
$bi->setText($foo);

into:
$bi = BreakIterator::createWordInstance(NULL, $foo);

This provides for less boilerplate code, since if you are creating
iterator chances are you have some string to iterate over already.

> Another possibility would be to have the break iterator itself behave 
> as the parts iterator for iteration purposes. I don't think that is a 
> good idea. Even though BreakIterator does not implement Iterator, people 
> would expect next() and current() return the next and current iterator 
> value, while they would be returning the iteration key.

OK, if you have to do getPartsIterator() it's fine as long as you can
easily do foreach on it, since that's what one expects from iterator.
I'd also add some flag that would skip or not skip whitespace, if this
is possible - so for "foo bar" sometimes you want ['foo', ' ', 'bar']
and sometimes you want just ['foo', 'bar'] - does ICU support it somehow?

Again, having some full description of proposed API would be nice.
For example, what hashCode() does?
-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to