Hey Iceberg Nation,

I wanted to propose having the public Apache Iceberg Slack
<https://apache-iceberg.slack.com/> chat and user data for the community to
use as a public data source. I have a couple of specific use cases in mind
that I would like to use it for, hence what brought me to ask about it.

The main problem I want to address for the community is the lack of
persistence of the answers we're generating in Slack. Slack is on a free
version that only retains the last 60 days of valuable threads happening
there. Questions are repeatedly asked, and this takes up time for everyone
in the community to answer the same questions multiple times. If we publish
the public chat and user data (i.e. no emails or user info outside of
what's displayed in Slack), then we can address this in the following ways:

   1. We can use this as a getting started tutorial featuring pyIceberg is
   to pull this dataset into a python or SQL ecosystem to learn about Iceberg,
   but also to discover old conversations that no longer appear on Slack. We
   can also take the raw data and push it into a local chatbot for folks to
   ask questions locally, build analytics projects etc...
   2. For those that are less interested in building your own chatbot or
   data pipeline, once this data is available, Tabular could use it to build
   and maintain a Discourse Forum <https://discourse.org/> (not to be
   confused with Discord). There are many reasons to add this on top of Slack,
   like persistence, discoverability via Google, curation and organization
   into wiki style to the point answers, and gamification, to make the goal
   that it's not just Tabular moderating this, but that the community takes
   over as they build trust similar to Stack Overflow. Of course, once we have
   the initial community working together there, we can use both Slack for
   faster messaging, and migrate specific valuable conversations to Discourse
   once it is done.
   3. Another idea, would be that we could also use the Discourse forum as
   one of the inputs to create some sort of chatbot experience, either in
   Slack or nested in the docs. This would likely outperform just directly
   training on Slack data as answers in Slack aren't verified and curated to
   the most concise form possible.
   4. The Slack and Tabular Discourse forum would be public to read, so
   this would allow for other companies in the space to build their own
   solutions.


The idea is that we would run a daily job that would export the Slack logs
to some public dumping ground (GitHub or something) to store this dataset.
Again, only public data that you could see if you signed up and logged into
Slack would be exposed.

How does this sound to everyone? Let me know if you have any questions or
other ideas!

Bits

Reply via email to