Re: [CODE4LIB] ethics of screenscraping library opacs?

Tim Spalding Sun, 28 Nov 2021 10:24:04 -0800

I echo what others have said about getting MARC. But a few thoughts about
screen scraping:

1. Obey robots.txt in all respects.
2. Obey best practices, such as waiting at least 1 second between requests.
3. If the site doesn't exclude robots, then it is absolutely getting hit
already. Chances are it's getting hit a LOT.
4. I might go easy or not even touch some old OPAC without a robots.txt.
Some early OPACs die if you hit them too much. But if it has a robots.txt
and that allows you, you should trust it.
5. I believe you're talking about UCLA's Primo? They have a
non-exclusionary robots.txt, which even includes a sitemap. (
https://uci.primo.exlibrisgroup.com/robots.txt ) To my mind that's an open
door.

Best,
Tim

On Thu, Nov 25, 2021 at 2:55 PM M Belvadi <mbelv...@gmail.com> wrote:

> Hi, all.
>
> What do you all think about code that screenscapes (eg python's Beautiful
> Soup) library opacs?
> Is it ok to do?
> Ok if it's throttled to a specific rate of hits per minute?
> Ok, if throttled AND is a really big library system where the load might
> not be relatively significant?
>
> Not entirely unrelated, is there an API for the new University of
> California Library Search system?
>
>
> Melissa Belvadi
> mbelv...@gmail.com
>

-- 
Check out my library at http://www.librarything.com/profile/timspalding

Re: [CODE4LIB] ethics of screenscraping library opacs?

Reply via email to