Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

Dave Fisher Thu, 16 Jan 2025 05:50:36 -0800

I think the question about (c) might best be directed to the VP, Privacy. It 
feels similar to email exposure of PII on mailing lists. Everything that was 
crawled was publicly available at one time or another.


I could be wrong, but Privacy is knows GPDR while LEGAL knows the AL2, etc.

> On Jan 16, 2025, at 8:26 AM, Tim Allison <talli...@apache.org> wrote:
> 
> This is a really helpful delineation of the issues. Thank you, Maruan, for
> this and for all of your support with the server.
> 
> I'll open a ticket on LEGAL's jira?
> 
> On Wed, Jan 15, 2025 at 3:55 AM sahy...@fileaffairs.de <
> sahy...@fileaffairs.de> wrote:
> 
>> Hi Tim,
>> 
>> IMHO there are several parts to it.
>> 
>> a) serving content which might look like other corps sites can be
>> interpreted as phishing
>> b) scraping and storing coyprighted content
>> c) scraping and storing content containing personal data
>> 
>> a) is being dealt with in the current form. As long as we don't
>> publicly serve the files we are fine. We could also allow password
>> protected https access if that has a benefit over ssh.
>> b) scraping copyrighted information is typically OK (there are legal
>> cases where this has been decided) although there might be cases where
>> we need to remove individual files
>> c) scraping and storing personal data is mostly not OK with GDPR and
>> other acts without permission. This becomes very difficult to handle.
>> E.g. if one uploaded a file to a bug tracker one could argue that if
>> that file contained personal data by uploading one gave permission to
>> use it within the context of the bug tracking and the dev process
>> behind it. That doesn't include permission to load the file from that
>> system and use it in a different context.
>> 
>> I think until c is sorted we can not allow access in a wider context
>> and even need to reconsider if we can use it at all although being very
>> beneficial.
>> 
>> Maybe we can have a chat with legal about that.
>> 
>> BR
>> Maruan
>> 
>> 
>> 
>> 
>> Am Dienstag, dem 14.01.2025 um 08:17 -0500 schrieb Tim Allison:
>>> Hi Stefan,
>>> 
>>>  I'm sorry for this sudden change. I'm hoping that we can find a way
>>> to
>>> make this all work again, but there are complexities. Part of the
>>> challenge
>>> is that the liability is spread across several organizations and
>>> individuals; part of the challenge is everything to do with the
>>> varying
>>> global legal/privacy requirements around crawled data. And there are
>>> other
>>> challenges.
>>> 
>>>  These corpora have been critical to numerous parsing projects at
>>> the ASF
>>> and to devs and projects outside of ASF.   I've heard from a few
>>> others
>>> offline who are also affected by this.
>>> 
>>> 
>>> All,
>>>  What are our priorities? How can we move forward? Some options that
>>> I see:
>>> 
>>> 0) nuclear option: shutdown the server entirely
>>> 1) continue as we have it now -- no http/s access
>>> 2) host reports/metadata only via https
>>> 3) host "packaged" corpora in zips (password protected?) via https
>>> 4) password protect https access to the corpora
>>> 5) not a viable option: turn everything back on
>>> 6) not a viable option: turn everything back on with a strict
>>> robots.txt
>>> policy
>>> 
>>>  Any other options? What are our preferences?
>>> 
>>>          Best,
>>> 
>>>                Tim
>>> 
>>> On Sat, Jan 11, 2025 at 9:01 AM stefan6419846
>>> <stefan6419...@gmail.com>
>>> wrote:
>>> 
>>>> We at pypdf (https://github.com/py-pdf/pypdf) have been hit by the
>>>> unexpected shutdown of the service and were glad to at least find
>>>> this
>>>> indirect announcement. Nevertheless, it seems like we have to find
>>>> a
>>>> suitable alternative for the previously used govdocs1 PDF files
>>>> from
>>>> your server, as the official govdocs1 sources do not expose the
>>>> single
>>>> PDF files directly.
>>>> 
>>>> Thanks for hosting these files in the past.
>>>> 
>>>> Best regards,
>>>> Stefan
>>>> 
>>>> On 2025/01/09 01:36:59 Tim Allison wrote:
>>>>> \All,
>>>>> We've gotten a handful of takedown requests recently. I had
>>>>> initially
>>>>> envisioned public sharing of files as a key component of our
>>>>> server. We
>>>> can
>>>>> still use the files and offer read access to fellow file
>>>>> researchers. I'm
>>>>> not sure I want to deal with further takedown requests.
>>>>> As an intermediate step, we could ask robots not to crawl the
>>>>> data, but
>>>>> that's not reliable.
>>>>> So, in lieu of that, with heavy heart, I ask if it is time to
>>>>> close off
>>>>> public access?
>>>>>  WDYT?
>>>>> 
>>>>>          Best,
>>>>> 
>>>>>                    Tim
>>>>> 
>>>> 
>> 
>>

Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

Reply via email to