Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Russell Spitzer Tue, 08 Aug 2023 14:53:11 -0700

I'm +1 as long as Slack TOS are ok with it. We already have full public
archives of the mailing list and I see slack as just an extension of the
mailing list.


On Tue, Aug 8, 2023 at 4:18 PM Brian Olsen <[email protected]> wrote:

> Hey Iceberg Nation,
>
> I wanted to propose having the public Apache Iceberg Slack
> <https://apache-iceberg.slack.com/> chat and user data for the community
> to use as a public data source. I have a couple of specific use cases in
> mind that I would like to use it for, hence what brought me to ask about it.
>
> The main problem I want to address for the community is the lack of
> persistence of the answers we're generating in Slack. Slack is on a free
> version that only retains the last 60 days of valuable threads happening
> there. Questions are repeatedly asked, and this takes up time for everyone
> in the community to answer the same questions multiple times. If we publish
> the public chat and user data (i.e. no emails or user info outside of
> what's displayed in Slack), then we can address this in the following ways:
>
>    1. We can use this as a getting started tutorial
>    featuring pyIceberg is to pull this dataset into a python or SQL ecosystem
>    to learn about Iceberg, but also to discover old conversations that no
>    longer appear on Slack. We can also take the raw data and push it into a
>    local chatbot for folks to ask questions locally, build analytics projects
>    etc...
>    2. For those that are less interested in building your own chatbot or
>    data pipeline, once this data is available, Tabular could use it to build
>    and maintain a Discourse Forum <https://discourse.org/> (not to be
>    confused with Discord). There are many reasons to add this on top of Slack,
>    like persistence, discoverability via Google, curation and organization
>    into wiki style to the point answers, and gamification, to make the goal
>    that it's not just Tabular moderating this, but that the community takes
>    over as they build trust similar to Stack Overflow. Of course, once we have
>    the initial community working together there, we can use both Slack for
>    faster messaging, and migrate specific valuable conversations to Discourse
>    once it is done.
>    3. Another idea, would be that we could also use the Discourse forum
>    as one of the inputs to create some sort of chatbot experience, either in
>    Slack or nested in the docs. This would likely outperform just directly
>    training on Slack data as answers in Slack aren't verified and curated to
>    the most concise form possible.
>    4. The Slack and Tabular Discourse forum would be public to read, so
>    this would allow for other companies in the space to build their own
>    solutions.
>
>
> The idea is that we would run a daily job that would export the Slack logs
> to some public dumping ground (GitHub or something) to store this dataset.
> Again, only public data that you could see if you signed up and logged into
> Slack would be exposed.
>
> How does this sound to everyone? Let me know if you have any questions or
> other ideas!
>
> Bits
>

Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

Reply via email to