Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Austin Bennett Mon, 14 Aug 2023 08:23:31 -0700

Had you considered using the ASF's slack?  That keeps history

On Tue, Aug 8, 2023 at 3:05 PM Brian Olsen <[email protected]> wrote:


> Good point, it looks like the main thing Slack's TOS
> <https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Salesforce_MSA.pdf?_gl=1*1u5n6fj*_ga*MTU2MzM4Mjk5OC4xNjgyNTM4NjIz*_ga_QTJQME5M5D*MTY5MTUzMDE1Mi40Mi4xLjE2OTE1MzA4MzIuMjkuMC4w>
>  in
> section 3.3 points us to Salesforce's External Facing Services Policy
> <https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf>
>  which
> addresses is the consent for businesses under NDAs on public or shared
> channels or private conversations or PII being exported without consent,
> and a bunch of other clearly illegal stuff we're not doing.
>
> I think since this data is public in the sense that anyone with the
> publicly available invite can join and read/see display names, we are fine.
> Slack has nothing in there about an PMC admin running an export to get
> access to the data that's owned by the ASF. So I believe as long as we get
> consent from the community and the PMC is okay with it, then we should be
> fine from a legal standpoint as long as we don't export private information
> like emails or private chats being included in this.
>
> On Tue, Aug 8, 2023 at 4:53 PM Russell Spitzer <[email protected]>
> wrote:
>
>> I'm +1 as long as Slack TOS are ok with it. We already have full public
>> archives of the mailing list and I see slack as just an extension of the
>> mailing list.
>>
>> On Tue, Aug 8, 2023 at 4:18 PM Brian Olsen <[email protected]>
>> wrote:
>>
>>> Hey Iceberg Nation,
>>>
>>> I wanted to propose having the public Apache Iceberg Slack
>>> <https://apache-iceberg.slack.com/> chat and user data for the
>>> community to use as a public data source. I have a couple of specific use
>>> cases in mind that I would like to use it for, hence what brought me to ask
>>> about it.
>>>
>>> The main problem I want to address for the community is the lack of
>>> persistence of the answers we're generating in Slack. Slack is on a free
>>> version that only retains the last 60 days of valuable threads happening
>>> there. Questions are repeatedly asked, and this takes up time for everyone
>>> in the community to answer the same questions multiple times. If we publish
>>> the public chat and user data (i.e. no emails or user info outside of
>>> what's displayed in Slack), then we can address this in the following ways:
>>>
>>>    1. We can use this as a getting started tutorial
>>>    featuring pyIceberg is to pull this dataset into a python or SQL 
>>> ecosystem
>>>    to learn about Iceberg, but also to discover old conversations that no
>>>    longer appear on Slack. We can also take the raw data and push it into a
>>>    local chatbot for folks to ask questions locally, build analytics 
>>> projects
>>>    etc...
>>>    2. For those that are less interested in building your own chatbot
>>>    or data pipeline, once this data is available, Tabular could use it to
>>>    build and maintain a Discourse Forum <https://discourse.org/> (not
>>>    to be confused with Discord). There are many reasons to add this on top 
>>> of
>>>    Slack, like persistence, discoverability via Google, curation and
>>>    organization into wiki style to the point answers, and gamification, to
>>>    make the goal that it's not just Tabular moderating this, but that the
>>>    community takes over as they build trust similar to Stack Overflow. Of
>>>    course, once we have the initial community working together there, we can
>>>    use both Slack for faster messaging, and migrate specific valuable
>>>    conversations to Discourse once it is done.
>>>    3. Another idea, would be that we could also use the Discourse forum
>>>    as one of the inputs to create some sort of chatbot experience, either in
>>>    Slack or nested in the docs. This would likely outperform just directly
>>>    training on Slack data as answers in Slack aren't verified and curated to
>>>    the most concise form possible.
>>>    4. The Slack and Tabular Discourse forum would be public to read, so
>>>    this would allow for other companies in the space to build their own
>>>    solutions.
>>>
>>>
>>> The idea is that we would run a daily job that would export the Slack
>>> logs to some public dumping ground (GitHub or something) to store this
>>> dataset. Again, only public data that you could see if you signed up and
>>> logged into Slack would be exposed.
>>>
>>> How does this sound to everyone? Let me know if you have any questions
>>> or other ideas!
>>>
>>> Bits
>>>
>>

Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Reply via email to