> On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote: > > Hm, you broke the comment :) > > Luis Silva wrote: > What do you mean? It all works fine here. > > Christoph Feck wrote: > Yes, because the compiler does not read comments.
Aside this, the approach seems too naive? DOIs have a defined structure, leading "doi: 10" (ignoring the case and making colon and whitespace optional) and in general the "problematic" tokens will have a massive digit overhead - so this could be used as additional test ( < 25 && looksLikeIndex()) - Thomas ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://git.reviewboard.kde.org/r/114632/#review46156 ----------------------------------------------------------- On Dec. 23, 2013, 4:14 p.m., Luis Silva wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://git.reviewboard.kde.org/r/114632/ > ----------------------------------------------------------- > > (Updated Dec. 23, 2013, 4:14 p.m.) > > > Review request for Baloo and Vishesh Handa. > > > Repository: kfilemetadata > > > Description > ------- > > A good portion of scientific papers in my collection had a doi or an index > number in the title. These are in general short string chains, shorter than > the real title. > I improve extraction of titles from pdf's by setting a minimum size below > which parsing of the first page is forced. > The cut-off size is arbitrarily set to 25 characters (three "big words"). > > > Diffs > ----- > > src/extractors/popplerextractor.cpp > b056581f51d10b632799586eed3cc15ac539fe80 > > Diff: https://git.reviewboard.kde.org/r/114632/diff/ > > > Testing > ------- > > This improved the title extraction on my pdf collection of scientific papers > by quite a lot. > > > Thanks, > > Luis Silva > >
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<