Re: Looping regexes against a list

Mike Martin Tue, 20 Jan 2015 09:49:39 -0800

Thanks for the idea about qr, I did try this before, but I've now relooked
at at it and got about 75% improvement.


As regards the uninitialized point the errors were coming from regexes
(different ones) when the regex wasnt matching, so testing the result of
each regex match was not really an option. As an aside the source is really
horrible - job ad listings.

Basically the idea is

Take a load of Job Vacancy posts (xml files - loads of)
Parse the Information, getting rid of as much garbage as possible
Push a distinct list into a lookup hash
Do replace to this list against a long list of regexes
Spit out nicely formatted Clean Job Titles




On 20 January 2015 at 15:21, Brandon McCaig <bamcc...@gmail.com> wrote:

> On Tue, Jan 20, 2015 at 9:51 AM, Andrew Solomon <and...@geekuni.com>
> wrote:
> > Aside from this lengthy rant^H^H^H^H discussion:) about where you put
> > your regex, have you made any progress on the performance problem you
> > put forward at the outset?
>
> I'm not quite sure that I understand what the OP is doing still, but a
> relatively simple thing to do to improve performance if he's not doing
> it yet might be to compile the regular expressions ahead of time
> (assuming they are loaded from text streams into strings and reused
> over and over again): perldoc -f qr//.
>
> If the hash is currently storing them as strings they will probably
> end up getting compiled over and over again, whereas if they are
> compiled beforehand that should speed things up a little bit. It might
> not be enough to improve performance completely, but it's certainly
> easy enough to do to be worth trying before approaching more in-depth
> refactoring.
>
> Beyond that, I'd look into alternative ways to structure the data for
> more efficient processing. Look at the look ups that need to be done
> and try to think of efficient ways to structure the data so they
> aren't needed. Alternatively, the OP could post a more complete
> program that we can play with...
>
> I wonder about the "no warnings 'uninitialized'" too. Why is that
> needed? I wonder if checking that things are defined first would be
> more efficient than attempting an operation on undef and maybe going
> through overhead to trigger and silence warnings about it or what not.
> I could see it being negligible, but I could see the opposite being
> true as well. You could probably even exclude those elements when
> loading the hashes so you never need to skip them later on. Just
> another simple thing to consider.
>
> Other than the regular expression's being text I think the only other
> solution is restructuring the program... You can look into using
> Benchmark.pm to compare the performance of different approaches.
>
> Regards,
>
>
> --
> Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
> Castopulence Software <https://www.castopulence.org/>
> Blog <http://www.bambams.ca/>
> perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
> q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
> tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>

Re: Looping regexes against a list

Reply via email to