I also think that is a good idea and interested to take this up. But before I do so, here are couple clarifications i would like. Based on what I recalled from tabulario/spark-iceberg, it has couple additional things are not there compared to official spark image: 1. couple additional python libs (e.g. jupyter notebook, pyiceberg, matplotlib, etc.) - for this one, if we switched to official spark image, do we want to build our custom image on top of that and also include those? they are not technically required as we don't mention jupyter notebook on our page nor README.md 2. couple notebooks published by Tabular earlier for simulating various behaviors with iceberg - if we still want to use those notebooks, we will need to create a form of those or we can do our own? 3. there is an ipython line cell magic used in jupyter notebook - same as above, only needed if we want to use the same set of notebooks. 4. they included IJava and scala kernels in jupyter notebook - same as above, only needed if we want to use the same set of notebooks. 5. couple parquet files for dummy dataset - same as above, only needed if we want to use the same set of notebooks.
That being said, if I would want to take this, should we remove the dependency on jupyter notebook and those extra notebooks/kernerl/libs etc.? On 2024/12/10 11:29:06 Fokko Driesprong wrote: > Yes, that's exactly my motivation (sorry for not stating this explicitly > earlier). Looking at the fact that the quickstart is currently outdated, I > would be reluctant to introduce additional Docker images and/or > repositories, since we need to update those as well. > > Kind regards, > Fokko > > Op di 10 dec 2024 om 11:48 schreef Ajantha Bhat <ajanthab...@gmail.com>: > > > That's a good suggestion Fokko. > > It would avoid maintaining one more docker image. We can update the > > quickstart to use the docker image provided by Spark. > > > > - Ajantha > > > > On Tue, Dec 10, 2024 at 4:08 PM Fokko Driesprong <fo...@apache.org> wrote: > > > >> Hey Ajantha, > >> > >> Thanks for bringing this up, we should both remove the vendor reference > >> and bring this back up to date. My preference would be to rely on the Spark > >> image <https://hub.docker.com/r/apache/spark> provided by the Apache > >> Spark project, similar to what we do for the Hive > >> <https://iceberg.apache.org/hive-quickstart/> quickstart. We should be > >> able to load all the Iceberg-specific JARs through the > >> spark.jars.packages configuration > >> <https://spark.apache.org/docs/3.5.1/configuration.html>. > >> > >> Kind regards, > >> Fokko > >> > >> Op di 10 dec 2024 om 11:16 schreef Ajantha Bhat <ajanthab...@gmail.com>: > >> > >>> The quickstart <https://iceberg.apache.org/spark-quickstart/> page is a > >>> critical touchpoint for new users and plays a key role in driving project > >>> adoption. > >>> Currently, it references *tabulario/spark-iceberg* and > >>> *tabulario/iceberg-rest* > >>> > >>> We’ve already replaced *tabulario/iceberg-rest* with the > >>> community-maintained Docker image, *apache/iceberg-rest-fixture*, based > >>> on the REST TCK fixture. > >>> > >>> However, *tabulario/spark-iceberg* seems outdated, and doesn't use the > >>> latest Iceberg version. > >>> To enhance the user experience and keep the quickstart aligned with > >>> project standards, I suggest hosting it either under the /docker folder in > >>> the Iceberg repository > >>> or as a subproject called *apache/iceberg-playground* where users can > >>> contribute to maintain other docker images. > >>> > >>> The quickstart page should ideally reference images maintained by the > >>> community rather than vendor-specific open-source projects. > >>> > >>> Thoughts? > >>> > >>> - Ajantha > >>> > >> >