Re: Help with regular expressions

Brian Fraser Mon, 09 May 2011 19:04:02 -0700

On Mon, May 9, 2011 at 6:35 PM, Kenneth Wolcott <kennethwolc...@gmail.com>wrote:


> Hasn't someone already fixed this problem?  If there isn't a CPAN module to
> perform standardized bibliographic reference formatting/parsing.  I haven't
> looked at CPAN; did either of you?  If a CPAN module doesn't exist, one
> should!
>

What standard?

Kalthoff K (2001) Analysis of biological development. McGraw-Hill, NY.


Or


> Manning JT, Barley L, Walton J, Lewis-Jones DI, Trivers RL, Singh D,
> Thornhill R, Rohde P, Bereczkei T, Henzi P, Soler M, Szwed A. (2000) The
> 2nd:4th digit ratio, sexual dimorphism, population differences, and
> reproductive success. evidence for sexually antagonistic genes? Evol Hum
> Behav. 21(3):163-183.


Or


> Berger, M., Lawrence, M., Demichelis, F., Drier, Y., Cibulskis, K.,
> Sivachenko, A., Sboner, A., Esgueva, R., Pflueger, D., Sougnez, C., Onofrio,
> R., Carter, S., Park, K., Habegger, L., Ambrogio, L., Fennell, T., Parkin,
> M., Saksena, G., Voet, D., Ramos, A., Pugh, T., Wilkinson, J., Fisher, S.,
> Winckler, W., Mahan, S., Ardlie, K., Baldwin, J., Simons, J., Kitabayashi,
> N., MacDonald, T., Kantoff, P., Chin, L., Gabriel, S., Gerstein, M., Golub,
> T., Meyerson, M., Tewari, A., Lander, E., Getz, G., Rubin, M., & Garraway,
> L. (2011). The genomic complexity of primary human prostate cancer Nature,
> 470 (7333), 214-220 DOI: 10.1038/nature09744


?

If there's a standard, then sure, someone has probably put that into CPAN.
The problem is that I don't think that there is, though I'd be glad to be
proven wrong.

On Mon, May 9, 2011 at 3:14 PM, Tiago Hori <tiago.h...@gmail.com> wrote:

> Hi List,
>
>
Howdy.



> What I want to be able to do eventually is parse each name separately and
> associate that with the title. I am not sure how yet, but I haven't even
> got
> there.
>
>
That can range from pretty simple to fairly complex, depending on how much
you want to squeeze out of that relationship. If you just want to be able to
say "Morgan, M.J wrote an article for X journal, titled Y", then that's just
a hash (of hashes), and you need to look no further than this mail. But if
you also want to say, "Journal X has these authors. One of them is Wilson,
C.E, who co-wrote article Y, where Crim, L.W. was also a collaborator, and
whose primary author is Morgan, M.J.", then hashes will probably not cut it
anymore (a cyclical hash of hashes might do, but that's pretty tough to
handle, and _very_ rough on the eyes). You'll probably want an object model
there, or some database interaction.

But we are getting ahead of ourselves for now :)


> foreach (@entries){
>    if (/((\w)*, (([A-Z].)*),){1,}/){
>
>
You probably want some like my @names = /( \w+, (?: [A-Z] \. )+ ,\s* )+/xg
instead.


>  my $name = "$&";
>
>
Try not to use $& and $` - There's a program-wide speed penalty if you do.
Just using capturing groups should make do.


> It works fine for the first name, but as expected if @entries contain
> several strings with authors names (I did that by matching the year and
> storing $` in the @entries) it will match the first author and it will go
> to
> the next $entries. Is there a way to match the pattern more than once, but
> to store each match separately?
>

You are looking for the /g switch. You can look it up in perlretut[0].


> For example, would I be able to store
> Morgan, M.J. as one item in an array and Wilson, C.E. as another one?
>
>
>
Sure. the my @names = ... from above will suffice for that. But chances are
you want more than that - In general, you have two options. Either you make
several small regexes to extract the data piece by piece, or you create a
grammar to do the job for you. For the latter, there's two main options: a
(?(DEFINE)) pattern, which is Pure Perl and in the language since 5.010, or
you pull out Regexp::Grammars from CPAN. They are pretty similar, but
Regexp::Grammars is much more powerful, letting you access the full parse
tree - so what I'll have to do in two steps in the next snippet, R::G would
do in one.

Here's my stab at it, using (?(DEFINE))[1], named captures[2], Unicode
character properties[3], and a probably unnecessary lookbehind[1] in the
split by the end. I made some arbitrary assumptions on the data, like saying
that a title can't be longer than 52 characters, or can't have a period in
it, or that the journal's name can't have digits in it, which I suppose is a
tad disingenuous, but take it as an example, not a solution : P

use 5.010;

$_ = 'Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on
reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.';

/
(?<all_names> (?&ALL_NAMES) )
(?<year> (?&YEAR) )\. \s+
(?<title> (?&TITLE) )\. \s+
(?<journal> (?&JOURNAL) )\. \s*
(?<edition> (?&NUM)+ ), \s*
(?<pages> (?&NUM)+-(?&NUM)+ )\.


(?(DEFINE)
(?<ALL_NAMES> ( (?&FULL_NAME), \s+)+ )
(?<FULL_NAME> (?&SURNAME), \s* (?&INITIALS) )
(?<SURNAME> \p{Lu}\p{L}* )
(?<INITIALS> (?:\p{Lu}\.)+ )
(?<YEAR> \p{PosixDigit}{4} )
(?<TITLE> [^.]{1,52} ) #Article title
(?<JOURNAL> \P{PosixDigit}+ ) #Journal name
(?<NUM> \p{PosixDigit} ) #A generic number. Maybe just Digit?
)
/x;
#Assuming it succeed, the results are in the %+ hash:
my @names = split /(?<=\.),\s*/, $+{all_names};

say @names;

(The same plus a small aggregation & dumping of the results:
http://ideone.com/Od3L7)

Brian.

[0] http://perldoc.perl.org/perlretut.html
[1] http://perldoc.perl.org/perlre.html#Extended-Patterns
[2] http://perldoc.perl.org/perlretut.html#Named-backreferences and
http://perldoc.perl.org/perlvar.html#%25%2b
[3] http://perldoc.perl.org/perluniprops.html

Re: Help with regular expressions

Reply via email to