Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Ryan Blue Mon, 14 Aug 2023 09:23:48 -0700

+1 for letting people use this dataset.

Austin, we originally used the ASF Slack, but decided to move for other
reasons (more channels, easier signup). And having history isn't actually
helping, since the information that people need is still only available in
Slack, which doesn't have very effective search.


On Mon, Aug 14, 2023 at 8:23 AM Austin Bennett <[email protected]> wrote:

> Had you considered using the ASF's slack?  That keeps history
>
> On Tue, Aug 8, 2023 at 3:05 PM Brian Olsen <[email protected]>
> wrote:
>
>> Good point, it looks like the main thing Slack's TOS
>> <https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Salesforce_MSA.pdf?_gl=1*1u5n6fj*_ga*MTU2MzM4Mjk5OC4xNjgyNTM4NjIz*_ga_QTJQME5M5D*MTY5MTUzMDE1Mi40Mi4xLjE2OTE1MzA4MzIuMjkuMC4w>
>>  in
>> section 3.3 points us to Salesforce's External Facing Services Policy
>> <https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf>
>>  which
>> addresses is the consent for businesses under NDAs on public or shared
>> channels or private conversations or PII being exported without consent,
>> and a bunch of other clearly illegal stuff we're not doing.
>>
>> I think since this data is public in the sense that anyone with the
>> publicly available invite can join and read/see display names, we are fine.
>> Slack has nothing in there about an PMC admin running an export to get
>> access to the data that's owned by the ASF. So I believe as long as we get
>> consent from the community and the PMC is okay with it, then we should be
>> fine from a legal standpoint as long as we don't export private information
>> like emails or private chats being included in this.
>>
>> On Tue, Aug 8, 2023 at 4:53 PM Russell Spitzer <[email protected]>
>> wrote:
>>
>>> I'm +1 as long as Slack TOS are ok with it. We already have full public
>>> archives of the mailing list and I see slack as just an extension of the
>>> mailing list.
>>>
>>> On Tue, Aug 8, 2023 at 4:18 PM Brian Olsen <[email protected]>
>>> wrote:
>>>
>>>> Hey Iceberg Nation,
>>>>
>>>> I wanted to propose having the public Apache Iceberg Slack
>>>> <https://apache-iceberg.slack.com/> chat and user data for the
>>>> community to use as a public data source. I have a couple of specific use
>>>> cases in mind that I would like to use it for, hence what brought me to ask
>>>> about it.
>>>>
>>>> The main problem I want to address for the community is the lack of
>>>> persistence of the answers we're generating in Slack. Slack is on a free
>>>> version that only retains the last 60 days of valuable threads happening
>>>> there. Questions are repeatedly asked, and this takes up time for everyone
>>>> in the community to answer the same questions multiple times. If we publish
>>>> the public chat and user data (i.e. no emails or user info outside of
>>>> what's displayed in Slack), then we can address this in the following ways:
>>>>
>>>>    1. We can use this as a getting started tutorial
>>>>    featuring pyIceberg is to pull this dataset into a python or SQL 
>>>> ecosystem
>>>>    to learn about Iceberg, but also to discover old conversations that no
>>>>    longer appear on Slack. We can also take the raw data and push it into a
>>>>    local chatbot for folks to ask questions locally, build analytics 
>>>> projects
>>>>    etc...
>>>>    2. For those that are less interested in building your own chatbot
>>>>    or data pipeline, once this data is available, Tabular could use it to
>>>>    build and maintain a Discourse Forum <https://discourse.org/> (not
>>>>    to be confused with Discord). There are many reasons to add this on top 
>>>> of
>>>>    Slack, like persistence, discoverability via Google, curation and
>>>>    organization into wiki style to the point answers, and gamification, to
>>>>    make the goal that it's not just Tabular moderating this, but that the
>>>>    community takes over as they build trust similar to Stack Overflow. Of
>>>>    course, once we have the initial community working together there, we 
>>>> can
>>>>    use both Slack for faster messaging, and migrate specific valuable
>>>>    conversations to Discourse once it is done.
>>>>    3. Another idea, would be that we could also use the Discourse
>>>>    forum as one of the inputs to create some sort of chatbot experience,
>>>>    either in Slack or nested in the docs. This would likely outperform just
>>>>    directly training on Slack data as answers in Slack aren't verified and
>>>>    curated to the most concise form possible.
>>>>    4. The Slack and Tabular Discourse forum would be public to read,
>>>>    so this would allow for other companies in the space to build their own
>>>>    solutions.
>>>>
>>>>
>>>> The idea is that we would run a daily job that would export the Slack
>>>> logs to some public dumping ground (GitHub or something) to store this
>>>> dataset. Again, only public data that you could see if you signed up and
>>>> logged into Slack would be exposed.
>>>>
>>>> How does this sound to everyone? Let me know if you have any questions
>>>> or other ideas!
>>>>
>>>> Bits
>>>>
>>>

-- 
Ryan Blue
Tabular

Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Reply via email to