On Mon, May 9, 2011 at 6:35 PM, Kenneth Wolcott <kennethwolc...@gmail.com>wrote:
> Hasn't someone already fixed this problem? If there isn't a CPAN module to > perform standardized bibliographic reference formatting/parsing. I haven't > looked at CPAN; did either of you? If a CPAN module doesn't exist, one > should! > What standard? Kalthoff K (2001) Analysis of biological development. McGraw-Hill, NY. Or > Manning JT, Barley L, Walton J, Lewis-Jones DI, Trivers RL, Singh D, > Thornhill R, Rohde P, Bereczkei T, Henzi P, Soler M, Szwed A. (2000) The > 2nd:4th digit ratio, sexual dimorphism, population differences, and > reproductive success. evidence for sexually antagonistic genes? Evol Hum > Behav. 21(3):163-183. Or > Berger, M., Lawrence, M., Demichelis, F., Drier, Y., Cibulskis, K., > Sivachenko, A., Sboner, A., Esgueva, R., Pflueger, D., Sougnez, C., Onofrio, > R., Carter, S., Park, K., Habegger, L., Ambrogio, L., Fennell, T., Parkin, > M., Saksena, G., Voet, D., Ramos, A., Pugh, T., Wilkinson, J., Fisher, S., > Winckler, W., Mahan, S., Ardlie, K., Baldwin, J., Simons, J., Kitabayashi, > N., MacDonald, T., Kantoff, P., Chin, L., Gabriel, S., Gerstein, M., Golub, > T., Meyerson, M., Tewari, A., Lander, E., Getz, G., Rubin, M., & Garraway, > L. (2011). The genomic complexity of primary human prostate cancer Nature, > 470 (7333), 214-220 DOI: 10.1038/nature09744 ? If there's a standard, then sure, someone has probably put that into CPAN. The problem is that I don't think that there is, though I'd be glad to be proven wrong. On Mon, May 9, 2011 at 3:14 PM, Tiago Hori <tiago.h...@gmail.com> wrote: > Hi List, > > Howdy. > What I want to be able to do eventually is parse each name separately and > associate that with the title. I am not sure how yet, but I haven't even > got > there. > > That can range from pretty simple to fairly complex, depending on how much you want to squeeze out of that relationship. If you just want to be able to say "Morgan, M.J wrote an article for X journal, titled Y", then that's just a hash (of hashes), and you need to look no further than this mail. But if you also want to say, "Journal X has these authors. One of them is Wilson, C.E, who co-wrote article Y, where Crim, L.W. was also a collaborator, and whose primary author is Morgan, M.J.", then hashes will probably not cut it anymore (a cyclical hash of hashes might do, but that's pretty tough to handle, and _very_ rough on the eyes). You'll probably want an object model there, or some database interaction. But we are getting ahead of ourselves for now :) > foreach (@entries){ > if (/((\w)*, (([A-Z].)*),){1,}/){ > > You probably want some like my @names = /( \w+, (?: [A-Z] \. )+ ,\s* )+/xg instead. > my $name = "$&"; > > Try not to use $& and $` - There's a program-wide speed penalty if you do. Just using capturing groups should make do. > It works fine for the first name, but as expected if @entries contain > several strings with authors names (I did that by matching the year and > storing $` in the @entries) it will match the first author and it will go > to > the next $entries. Is there a way to match the pattern more than once, but > to store each match separately? > You are looking for the /g switch. You can look it up in perlretut[0]. > For example, would I be able to store > Morgan, M.J. as one item in an array and Wilson, C.E. as another one? > > > Sure. the my @names = ... from above will suffice for that. But chances are you want more than that - In general, you have two options. Either you make several small regexes to extract the data piece by piece, or you create a grammar to do the job for you. For the latter, there's two main options: a (?(DEFINE)) pattern, which is Pure Perl and in the language since 5.010, or you pull out Regexp::Grammars from CPAN. They are pretty similar, but Regexp::Grammars is much more powerful, letting you access the full parse tree - so what I'll have to do in two steps in the next snippet, R::G would do in one. Here's my stab at it, using (?(DEFINE))[1], named captures[2], Unicode character properties[3], and a probably unnecessary lookbehind[1] in the split by the end. I made some arbitrary assumptions on the data, like saying that a title can't be longer than 52 characters, or can't have a period in it, or that the journal's name can't have digits in it, which I suppose is a tad disingenuous, but take it as an example, not a solution : P use 5.010; $_ = 'Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.'; / (?<all_names> (?&ALL_NAMES) ) (?<year> (?&YEAR) )\. \s+ (?<title> (?&TITLE) )\. \s+ (?<journal> (?&JOURNAL) )\. \s* (?<edition> (?&NUM)+ ), \s* (?<pages> (?&NUM)+-(?&NUM)+ )\. (?(DEFINE) (?<ALL_NAMES> ( (?&FULL_NAME), \s+)+ ) (?<FULL_NAME> (?&SURNAME), \s* (?&INITIALS) ) (?<SURNAME> \p{Lu}\p{L}* ) (?<INITIALS> (?:\p{Lu}\.)+ ) (?<YEAR> \p{PosixDigit}{4} ) (?<TITLE> [^.]{1,52} ) #Article title (?<JOURNAL> \P{PosixDigit}+ ) #Journal name (?<NUM> \p{PosixDigit} ) #A generic number. Maybe just Digit? ) /x; #Assuming it succeed, the results are in the %+ hash: my @names = split /(?<=\.),\s*/, $+{all_names}; say @names; (The same plus a small aggregation & dumping of the results: http://ideone.com/Od3L7) Brian. [0] http://perldoc.perl.org/perlretut.html [1] http://perldoc.perl.org/perlre.html#Extended-Patterns [2] http://perldoc.perl.org/perlretut.html#Named-backreferences and http://perldoc.perl.org/perlvar.html#%25%2b [3] http://perldoc.perl.org/perluniprops.html