On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote: > On 21/02/2025 08:00, David Wright wrote: > > I dragged the mouse > > across the Males table and dumped it in a file. > > David, I recall you mentioned xpdf in your messages. It allows to > select rectangular regions. Sometimes it is convenient since this > strategy does not depend on order of objects inside PDF files.
Yes, xpdf is my goto PDF viewer, and I should have mentioned that in the post. > Other PDF viewers allows to conveniently select contiguous spans of > text, e.g. end of some line and beginning of next one. Unfortunately > enough PDF files have pieces of text put in almost random order. At > least in Firefox selection may work in a quite peculiar way skipping > some fragments and adding visually unrelated ones. Yes, I scrape web pages from FF fairly frequently (new mostly), and am familiar with the particular structures that result with different organisations. And ^A^C is a useful tool that can scrape off-screen text which gets blotted out if you try to scroll to it, ie requiring login or whatever to view the page. > So selection of text in PDF files may strongly depend on viewer. Yes, most of the others I have will paste text that's as jumbled as raw pdftotext, eg evince, zathura. With mupdf, I don't even know how to copy, as the mouse just drags the page around. > P.S. "pdftotext -layout" in some cases is better than without > "-layout". I think the results are roughly comparable with my scrapings, for this document at least. Perhaps both pdftotext and xpdf rely on poppler to do the work. > When text file has properly aligned columns, instead of > "quoting" some spaces, it may be better to add TAB characters at > certain positions on each line. Perhaps LibreOffice Calc even has GUI > to select column widths during importing of text files. Yes, gnumeric has that too. But I would hate to have a lot of mousework if I were repeating this frequently. And for a postprandial one-off, I just took a no-tools approach (barring an editor, of course). Cheers, David.