Coming from a "pleb", my only concern is the name if the class is in the global scope. A "BreakIterator" to me sounds like something related to breaking out of a looping structure, and not something used for iterating over various language structure boundaries. If it's in a ICU namespace, then it's not a problem, as it's clearly related to Unicode.
Cheers, David On 31/05/12 21:21, Gustavo Lopes wrote: > Hi > > I've wrapped ICU's BreakIterator and RuleBasedBreakIterator. I stopped > short of adding a procedural interface. I think there's a larger > expectation of a having an OOP interface when working with iterators. > What do you think? If there's no procedural interface, I'll change the > instances of zend_parse_methods to zpp for performance. > > Now I'll copy the commit message here if someone want to comment on a > specific point inline: > > ---- > BreakIterator and RuleBasedBreakiterator added > This commit adds wrappers for the classes BreakIterator and > RuleBasedbreakIterator. The C++ ICU classes are described here: > <http://icu-project.org/apiref/icu4c/classBreakIterator.html> > <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html> > > Additionally, a tutorial is available at: > <http://userguide.icu-project.org/boundaryanalysis> > > This implementation wraps UTF-8 text in a UText. The text is > iterated without any copying or conversion to UTF-16. There is > also no validation that the input is actually UTF-8; where there > are malformed sequences, the UText will simply U+FFFD. > > The class BreakIterator cannot be instantiated directly (has a > private constructor). It provides the interface exposed by the ICU > abstract class with the same name. The PHP class is not abstract > because we may use it to wrap native subclasses of BreakIterator > that we don't know how to wrap. This class includes methods to > move the iterator position to the beginning (first()), to the > end (last()), forward (next()), backwards (previous()), to the > boundary preceding a certain position (preceding()) and following > a certain position (following()) and to obtain the current position > (current()). next() can also be used to advance or recede an > arbitrary number of positions. > > BreakIterator also exposes other native methods: > getAvailableLocales(), getLocale() and factory methods to build > several predefined types of BreakIterators: createWordInstance() > for word boundaries, createCharacterInstance() for locale > dependent notions of "characters", createSentenceInstance() for > sentences, createLineInstance() and createTitleInstance() -- for > title casing breaks. These factories currently return > RuleBasedbreakIterators where the names of the rule sets are found > in the ICU data, observing the passed locale (although the locale > is taken into considering there are very few exceptions to the > root rules). > > The clone and compare_object PHP object handlers are also > implemented, though the comparison does not yield meaningful results > when used with >, <, >= and <=. > > Note that BreakIterator is an iterator only in the sense of the > first 'Iterator' in 'IteratorIterator', i.e., it does not > implement the Iterator interface. The reason is that there is > no sensible implementation for Iterator::key(). Using it for > an ordinal of the current boundary is not feasible because > we are allowed to move to any boundary at any time. It we were > to determine the current ordinal when last() is called we'd > have to traverse the whole input text to find out how many > breaks there were before. Therefore, BreakIterator implements > only Traversable. It can be wrapped in an IteratorIterator, > but the usual warnings apply. > > Finally, I added a convenience method to BreakIterator: > getPartsIterator(). This provides an IntlIterator, backed > by the BreakIterator PHP object (i.e. moving the pointer or > changing the text in BreakIterator affects the iterator > and also moving the iterator affects the backing BreakIterator), > which allows traversing the text between each boundary. > This iterator uses the original text to retrieve the text > between two positions, not the code points returned by the > wrapping UText. Therefore, if the text includes invalid code > unit sequences, these invalid sequences will be in the output > of this iterator, not U+FFFD code points. > > The class RuleBasedIterator exposes a constructor that allows > building an iterator from arbitrary compiled or non-compiled > rules. The form of these rules in described in the tutorial linked > above. The rest of the methods allow retrieving the rules -- > getRules() and getCompiledRules() --, a hash code of the rule set > (hashCode()) and the rules statuses (getRuleStatus() and > getRuleStatusVec()). > > Because the RuleBasedBreakIterator constructor may return parse > errors, I reuse the UParseError to text function that was in the > transliterator files. Therefore, I move that function to > intl_error.c. > > common_enum.cpp was also changed, mainly to expose previously > static functions. This avoided code duplication when implementing > the BreakIterator iterator and the IntlIterator returned by > BreakIterator::getPartsIterator(). > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php