Re: Review Request 114632: Improve pdf title extraction

Thomas Lübking Mon, 06 Jan 2014 08:19:14 -0800


> On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
> > Hm, you broke the comment :)
> 
> Luis Silva wrote:
>     What do you mean? It all works fine here.
> 
> Christoph Feck wrote:
>     Yes, because the compiler does not read comments.


Aside this, the approach seems too naive?
DOIs have a defined structure, leading "doi: 10" (ignoring the case and making 
colon and whitespace optional) and in general the "problematic" tokens will 
have a massive digit overhead - so this could be used as additional test ( < 25 
&& looksLikeIndex())


- Thomas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
-----------------------------------------------------------


On Dec. 23, 2013, 4:14 p.m., Luis Silva wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/114632/
> -----------------------------------------------------------
> 
> (Updated Dec. 23, 2013, 4:14 p.m.)
> 
> 
> Review request for Baloo and Vishesh Handa.
> 
> 
> Repository: kfilemetadata
> 
> 
> Description
> -------
> 
> A good portion of scientific papers in my collection had a doi or an index 
> number in the title. These are in general short string chains, shorter than 
> the real title.
> I improve extraction of titles from pdf's by setting a minimum size below 
> which parsing of the first page is forced.
> The cut-off size is arbitrarily set to 25 characters (three "big words").
> 
> 
> Diffs
> -----
> 
>   src/extractors/popplerextractor.cpp 
> b056581f51d10b632799586eed3cc15ac539fe80 
> 
> Diff: https://git.reviewboard.kde.org/r/114632/diff/
> 
> 
> Testing
> -------
> 
> This improved the title extraction on my pdf collection of scientific papers 
> by quite a lot.
> 
> 
> Thanks,
> 
> Luis Silva
> 
>

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Re: Review Request 114632: Improve pdf title extraction

Reply via email to