On Sat, Jan 31, 2015 at 12:22 PM, Bruno P. Kinoshita <brunodepau...@yahoo.com.br> wrote: > Hi Benson! > I wouldn't be able to help at the moment, but some years ago I had a > performance issue in a Nutch crawler with regexes [1] and found about this > other library that you mentioned I think. Are you talking about ORO?
Yes, I believe I'm referring to ORO. I'm not really even looking for help. I am looking to see if there is enough interest to justify exploring pushing the code into the ASF. We did benchmarks, it's faster than built-in Java in a variety of cases. We are precluded from using GPL, so we didn't look seriously at OpenRegex. We want to have a system where outsiders can supply any regex they like and we don't have to worry about one of our servers being eaten by it. > I ended up changing the regex and never had a chance to play with ORO or > other libraries to see if there was any advantage over not using JRE's regex > API. Recently I had another performance problem with Apache Hive SerDe and > performance problems and fixed it by changing the storage format and > simplifying the regex. > Have you done any performance comparison with your code and other libraries? > More or less like this [2]? Maybe this library could be used as an > alternative in Nutch, Commons Crawl or in other projects when performance was > important. > Lastly, I'm using OpenRegex (GPLv3) [3] in a project, in combination with > Apache OpenNLP. It is a "regular expression language and engine" that users > can use to match string and NLP tags. For example: > <string='My Company'> <lemma='be'> <postag='RB'>* (<adjective>: > <postag='JJ'>)) > Where <lemma='be'> will match any form of be/is/was/were/etc, <postag='RB'>* > one or more adverbs and the last part of the expression will find a named > token "adjective" (JJ is the Penn Tree Bank part of speech tag for > adjectives). > Not sure if your library will work only with text or will support any other > approaches too. OpenRegex has some TODO's in the GitHub Wiki but hasn't been > updated in a while. Maybe if your library could work similarly to OpenRegex, > it could be incorporated in Apache OpenNLP too. Even the LanguageTool team > demonstrated some interest in experimenting it [4]. > Just food for thought :-)Bruno > [1] https://issues.apache.org/jira/browse/NUTCH-1014[2] > http://tusker.org/regex/regex_benchmark.html[3] > https://github.com/knowitall/openregex[4] > http://sourceforge.net/p/languagetool/mailman/languagetool-devel/thread/69f229c0a58d3245d511dafaa82feafc%40danielnaber.de/#msg31280519 > > From: Benson Margulies <bimargul...@gmail.com> > To: Commons Developers List <dev@commons.apache.org> > Sent: Saturday, January 31, 2015 1:58 PM > Subject: Anyone interested in regular expressions, again? > > So, once upon a time, there was a regex library here. It was retired, > presumably on the grounds that it was rendered obsolete by the JRE's > native support. > > However, the JRE's regular expressions have a pretty severe problem; > they have unbounded (or at least, very, very, bad) execution time for > some combinations of data and regex. > > To cope with this, we ported the Henry Spencer regular expression > library (as found in TCL) from C to Java. > > Thus: https://github.com/basis-technology-corp/tcl-regex-java > > Is anyone interested in this? Give or take the possible IP muddle of > the original C Code, I could grant it easily. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org