Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Brian Olsen Tue, 08 Aug 2023 15:05:23 -0700

Good point, it looks like the main thing Slack's TOS
<https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Salesforce_MSA.pdf?_gl=1*1u5n6fj*_ga*MTU2MzM4Mjk5OC4xNjgyNTM4NjIz*_ga_QTJQME5M5D*MTY5MTUzMDE1Mi40Mi4xLjE2OTE1MzA4MzIuMjkuMC4w>
in
section 3.3 points us to Salesforce's External Facing Services Policy
<https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf>
which
addresses is the consent for businesses under NDAs on public or shared
channels or private conversations or PII being exported without consent,
and a bunch of other clearly illegal stuff we're not doing.


I think since this data is public in the sense that anyone with the
publicly available invite can join and read/see display names, we are fine.
Slack has nothing in there about an PMC admin running an export to get
access to the data that's owned by the ASF. So I believe as long as we get
consent from the community and the PMC is okay with it, then we should be
fine from a legal standpoint as long as we don't export private information
like emails or private chats being included in this.

On Tue, Aug 8, 2023 at 4:53 PM Russell Spitzer <[email protected]>
wrote:

> I'm +1 as long as Slack TOS are ok with it. We already have full public
> archives of the mailing list and I see slack as just an extension of the
> mailing list.
>
> On Tue, Aug 8, 2023 at 4:18 PM Brian Olsen <[email protected]>
> wrote:
>
>> Hey Iceberg Nation,
>>
>> I wanted to propose having the public Apache Iceberg Slack
>> <https://apache-iceberg.slack.com/> chat and user data for the community
>> to use as a public data source. I have a couple of specific use cases in
>> mind that I would like to use it for, hence what brought me to ask about it.
>>
>> The main problem I want to address for the community is the lack of
>> persistence of the answers we're generating in Slack. Slack is on a free
>> version that only retains the last 60 days of valuable threads happening
>> there. Questions are repeatedly asked, and this takes up time for everyone
>> in the community to answer the same questions multiple times. If we publish
>> the public chat and user data (i.e. no emails or user info outside of
>> what's displayed in Slack), then we can address this in the following ways:
>>
>>    1. We can use this as a getting started tutorial
>>    featuring pyIceberg is to pull this dataset into a python or SQL ecosystem
>>    to learn about Iceberg, but also to discover old conversations that no
>>    longer appear on Slack. We can also take the raw data and push it into a
>>    local chatbot for folks to ask questions locally, build analytics projects
>>    etc...
>>    2. For those that are less interested in building your own chatbot or
>>    data pipeline, once this data is available, Tabular could use it to build
>>    and maintain a Discourse Forum <https://discourse.org/> (not to be
>>    confused with Discord). There are many reasons to add this on top of 
>> Slack,
>>    like persistence, discoverability via Google, curation and organization
>>    into wiki style to the point answers, and gamification, to make the goal
>>    that it's not just Tabular moderating this, but that the community takes
>>    over as they build trust similar to Stack Overflow. Of course, once we 
>> have
>>    the initial community working together there, we can use both Slack for
>>    faster messaging, and migrate specific valuable conversations to Discourse
>>    once it is done.
>>    3. Another idea, would be that we could also use the Discourse forum
>>    as one of the inputs to create some sort of chatbot experience, either in
>>    Slack or nested in the docs. This would likely outperform just directly
>>    training on Slack data as answers in Slack aren't verified and curated to
>>    the most concise form possible.
>>    4. The Slack and Tabular Discourse forum would be public to read, so
>>    this would allow for other companies in the space to build their own
>>    solutions.
>>
>>
>> The idea is that we would run a daily job that would export the Slack
>> logs to some public dumping ground (GitHub or something) to store this
>> dataset. Again, only public data that you could see if you signed up and
>> logged into Slack would be exposed.
>>
>> How does this sound to everyone? Let me know if you have any questions or
>> other ideas!
>>
>> Bits
>>
>

Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Reply via email to