This is a really helpful delineation of the issues. Thank you, Maruan, for this and for all of your support with the server.
I'll open a ticket on LEGAL's jira? On Wed, Jan 15, 2025 at 3:55 AM sahy...@fileaffairs.de < sahy...@fileaffairs.de> wrote: > Hi Tim, > > IMHO there are several parts to it. > > a) serving content which might look like other corps sites can be > interpreted as phishing > b) scraping and storing coyprighted content > c) scraping and storing content containing personal data > > a) is being dealt with in the current form. As long as we don't > publicly serve the files we are fine. We could also allow password > protected https access if that has a benefit over ssh. > b) scraping copyrighted information is typically OK (there are legal > cases where this has been decided) although there might be cases where > we need to remove individual files > c) scraping and storing personal data is mostly not OK with GDPR and > other acts without permission. This becomes very difficult to handle. > E.g. if one uploaded a file to a bug tracker one could argue that if > that file contained personal data by uploading one gave permission to > use it within the context of the bug tracking and the dev process > behind it. That doesn't include permission to load the file from that > system and use it in a different context. > > I think until c is sorted we can not allow access in a wider context > and even need to reconsider if we can use it at all although being very > beneficial. > > Maybe we can have a chat with legal about that. > > BR > Maruan > > > > > Am Dienstag, dem 14.01.2025 um 08:17 -0500 schrieb Tim Allison: > > Hi Stefan, > > > > I'm sorry for this sudden change. I'm hoping that we can find a way > > to > > make this all work again, but there are complexities. Part of the > > challenge > > is that the liability is spread across several organizations and > > individuals; part of the challenge is everything to do with the > > varying > > global legal/privacy requirements around crawled data. And there are > > other > > challenges. > > > > These corpora have been critical to numerous parsing projects at > > the ASF > > and to devs and projects outside of ASF. I've heard from a few > > others > > offline who are also affected by this. > > > > > > All, > > What are our priorities? How can we move forward? Some options that > > I see: > > > > 0) nuclear option: shutdown the server entirely > > 1) continue as we have it now -- no http/s access > > 2) host reports/metadata only via https > > 3) host "packaged" corpora in zips (password protected?) via https > > 4) password protect https access to the corpora > > 5) not a viable option: turn everything back on > > 6) not a viable option: turn everything back on with a strict > > robots.txt > > policy > > > > Any other options? What are our preferences? > > > > Best, > > > > Tim > > > > On Sat, Jan 11, 2025 at 9:01 AM stefan6419846 > > <stefan6419...@gmail.com> > > wrote: > > > > > We at pypdf (https://github.com/py-pdf/pypdf) have been hit by the > > > unexpected shutdown of the service and were glad to at least find > > > this > > > indirect announcement. Nevertheless, it seems like we have to find > > > a > > > suitable alternative for the previously used govdocs1 PDF files > > > from > > > your server, as the official govdocs1 sources do not expose the > > > single > > > PDF files directly. > > > > > > Thanks for hosting these files in the past. > > > > > > Best regards, > > > Stefan > > > > > > On 2025/01/09 01:36:59 Tim Allison wrote: > > > > \All, > > > > We've gotten a handful of takedown requests recently. I had > > > > initially > > > > envisioned public sharing of files as a key component of our > > > > server. We > > > can > > > > still use the files and offer read access to fellow file > > > > researchers. I'm > > > > not sure I want to deal with further takedown requests. > > > > As an intermediate step, we could ask robots not to crawl the > > > > data, but > > > > that's not reliable. > > > > So, in lieu of that, with heavy heart, I ask if it is time to > > > > close off > > > > public access? > > > > WDYT? > > > > > > > > Best, > > > > > > > > Tim > > > > > > > > >