Hey Iceberg Nation, I wanted to propose having the public Apache Iceberg Slack <https://apache-iceberg.slack.com/> chat and user data for the community to use as a public data source. I have a couple of specific use cases in mind that I would like to use it for, hence what brought me to ask about it.
The main problem I want to address for the community is the lack of persistence of the answers we're generating in Slack. Slack is on a free version that only retains the last 60 days of valuable threads happening there. Questions are repeatedly asked, and this takes up time for everyone in the community to answer the same questions multiple times. If we publish the public chat and user data (i.e. no emails or user info outside of what's displayed in Slack), then we can address this in the following ways: 1. We can use this as a getting started tutorial featuring pyIceberg is to pull this dataset into a python or SQL ecosystem to learn about Iceberg, but also to discover old conversations that no longer appear on Slack. We can also take the raw data and push it into a local chatbot for folks to ask questions locally, build analytics projects etc... 2. For those that are less interested in building your own chatbot or data pipeline, once this data is available, Tabular could use it to build and maintain a Discourse Forum <https://discourse.org/> (not to be confused with Discord). There are many reasons to add this on top of Slack, like persistence, discoverability via Google, curation and organization into wiki style to the point answers, and gamification, to make the goal that it's not just Tabular moderating this, but that the community takes over as they build trust similar to Stack Overflow. Of course, once we have the initial community working together there, we can use both Slack for faster messaging, and migrate specific valuable conversations to Discourse once it is done. 3. Another idea, would be that we could also use the Discourse forum as one of the inputs to create some sort of chatbot experience, either in Slack or nested in the docs. This would likely outperform just directly training on Slack data as answers in Slack aren't verified and curated to the most concise form possible. 4. The Slack and Tabular Discourse forum would be public to read, so this would allow for other companies in the space to build their own solutions. The idea is that we would run a daily job that would export the Slack logs to some public dumping ground (GitHub or something) to store this dataset. Again, only public data that you could see if you signed up and logged into Slack would be exposed. How does this sound to everyone? Let me know if you have any questions or other ideas! Bits