On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote:
> On 21/02/2025 08:00, David Wright wrote:
> > I dragged the mouse
> > across the Males table and dumped it in a file.
> 
> David, I recall you mentioned xpdf in your messages. It allows to
> select rectangular regions. Sometimes it is convenient since this
> strategy does not depend on order of objects inside PDF files.

Yes, xpdf is my goto PDF viewer, and I should have mentioned that
in the post.

> Other PDF viewers allows to conveniently select contiguous spans of
> text, e.g. end of some line and beginning of next one. Unfortunately
> enough PDF files have pieces of text put in almost random order. At
> least in Firefox selection may work in a quite peculiar way skipping
> some fragments and adding visually unrelated ones.

Yes, I scrape web pages from FF fairly frequently (new mostly),
and am familiar with the particular structures that result with
different organisations. And ^A^C is a useful tool that can
scrape off-screen text which gets blotted out if you try to scroll
to it, ie requiring login or whatever to view the page.

> So selection of text in PDF files may strongly depend on viewer.

Yes, most of the others I have will paste text that's as jumbled
as raw pdftotext, eg evince, zathura. With mupdf, I don't even
know how to copy, as the mouse just drags the page around.

> P.S. "pdftotext -layout" in some cases is better than without
> "-layout".

I think the results are roughly comparable with my scrapings,
for this document at least. Perhaps both pdftotext and xpdf
rely on poppler to do the work.

> When text file has properly aligned columns, instead of
> "quoting" some spaces, it may be better to add TAB characters at
> certain positions on each line. Perhaps LibreOffice Calc even has GUI
> to select column widths during importing of text files.

Yes, gnumeric has that too. But I would hate to have a lot of
mousework if I were repeating this frequently. And for a
postprandial one-off, I just took a no-tools approach
(barring an editor, of course).

Cheers,
David.

Reply via email to