Re: Anyone interested in regular expressions, again?

Benson Margulies Sat, 31 Jan 2015 09:30:07 -0800

On Sat, Jan 31, 2015 at 12:22 PM, Bruno P. Kinoshita
<brunodepau...@yahoo.com.br> wrote:
> Hi Benson!
> I wouldn't be able to help at the moment, but some years ago I had a 
> performance issue in a Nutch crawler with regexes [1] and found about this 
> other library that you mentioned I think. Are you talking about ORO?


Yes, I believe I'm referring to ORO. I'm not really even looking for
help. I am looking to see if there is enough interest to justify
exploring pushing the code into the ASF. We did benchmarks, it's
faster than built-in Java in a variety of cases. We are precluded from
using GPL, so we didn't look seriously at OpenRegex. We want to have a
system where outsiders can supply any regex they like and we don't
have to worry about one of our servers being eaten by it.


> I ended up changing the regex and never had a chance to play with ORO or 
> other libraries to see if there was any advantage over not using JRE's regex 
> API. Recently I had another performance problem with Apache Hive SerDe and 
> performance problems and fixed it by changing the storage format and 
> simplifying the regex.
> Have you done any performance comparison with your code and other libraries? 
> More or less like this [2]? Maybe this library could be used as an 
> alternative in Nutch, Commons Crawl or in other projects when performance was 
> important.
> Lastly, I'm using OpenRegex (GPLv3) [3] in a project, in combination with 
> Apache OpenNLP. It is a "regular expression language and engine" that users 
> can use to match string and NLP tags. For example:
> <string='My Company'> <lemma='be'> <postag='RB'>* (<adjective>: 
> <postag='JJ'>))
> Where <lemma='be'> will match any form of be/is/was/were/etc, <postag='RB'>* 
> one or more adverbs and the last part of the expression will find a named 
> token "adjective" (JJ is the Penn Tree Bank part of speech tag for 
> adjectives).
> Not sure if your library will work only with text or will support any other 
> approaches too. OpenRegex has some TODO's in the GitHub Wiki but hasn't been 
> updated in a while. Maybe if your library could work similarly to OpenRegex, 
> it could be incorporated in Apache OpenNLP too. Even the LanguageTool team 
> demonstrated some interest in experimenting it [4].
> Just food for thought :-)Bruno
> [1] https://issues.apache.org/jira/browse/NUTCH-1014[2] 
> http://tusker.org/regex/regex_benchmark.html[3] 
> https://github.com/knowitall/openregex[4] 
> http://sourceforge.net/p/languagetool/mailman/languagetool-devel/thread/69f229c0a58d3245d511dafaa82feafc%40danielnaber.de/#msg31280519
>
>       From: Benson Margulies <bimargul...@gmail.com>
>  To: Commons Developers List <dev@commons.apache.org>
>  Sent: Saturday, January 31, 2015 1:58 PM
>  Subject: Anyone interested in regular expressions, again?
>
> So, once upon a time, there was a regex library here. It was retired,
> presumably on the grounds that it was rendered obsolete by the JRE's
> native support.
>
> However, the JRE's regular expressions have a pretty severe problem;
> they have unbounded (or at least, very, very, bad) execution time for
> some combinations of data and regex.
>
> To cope with this, we ported the Henry Spencer regular expression
> library (as found in TCL) from C to Java.
>
> Thus: https://github.com/basis-technology-corp/tcl-regex-java
>
> Is anyone interested in this? Give or take the possible IP muddle of
> the original C Code, I could grant it easily.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: Anyone interested in regular expressions, again?

Reply via email to