Amazing guidance, thanks so much Even. You've answered a lot of hanging questions I had that I didn't know where to ask. I'll be exploring all of this.
Cheers, Mike On Wed, Jul 24, 2024 at 7:59 PM Even Rouault <even.roua...@spatialys.com> wrote: > Michael, > > I don't think this would be a frmts/raw driver, but rather a /vsikerchunk > virtual file system that you would combine with the Zarr driver > > So you would open a dataset with "/vsikerchunk/{path/to.json}", and the > ZARR driver would then issue a ReadDir() operation on > /vsikerchunk/{path/to.json}, which would return the top level keys of the > JSON. Then the Zarr driver would issue a Open() operation on > "/vsikerchunk/{path/to.json}/.zmetadata", and so on. The Zarr driver could > be essentially unmodified. This is I believe essentially how the Python > implementation works when combining the Kerchunk specific part with the > Python Zarr module (except it passes file system objects and not strings). > > Where things don't get pretty is for big datasets, where that JSON file > can become so big that parsing it and holding it in memory becomes an > annoyance. They have come apparently to using a hierarchy of Parquet files > to store the references to the blocks: > https://fsspec.github.io/kerchunk/spec.html#parquet-references . That's > becoming a bit messy. Should be implementable though > > There are also subtelties in Kerchunk v1 with jinja substitution, and > generators of keys, all tricks to decrease the size of the JSON, that would > complicate an implementation. > > On Kerchunk itself, I don't have any experience, but I feel there might be > limitations to what it can handle due to the underlying raster formats. For > example, if you have a GeoTIFF file using JPEG compression, with the > quantization tables being stored in the TIFF JpegTables tag (i.e. shared > for all tiles), which is the formulation that GDAL would use by default on > creation, then I don't see how Kerchunk can deal with that, since that > would be 2 distincts chunks in the file, and the recombination is slightly > more complicated than just appending them together before passing them to a > JPEG codec. Similarly if you wanted to Kerchunk a GeoPackage raster, you > couldn't, because a single tile in SQLite3 generally spans over multiple > SQLite3 pages (of size 4096), with a few "header" bytes at the beginning of > each tile. For GRIB2, there are certainly limitations to some formulations > because some GRIB2 encoding for arrays are really particular. It must work > only with the most simple raw encoding. > > Kerchunk can potentially do virtual tiling, but I believe that all tiles > must have the same dimensions, and their internal tiling to be a multiple > of that dimension, so you can create a Zarr compatible representation of > them. > > And obviously one strong assumption of Kerchunk is that the files > referenced by a Kerchunk index are immutable. If for some reason, tiles are > moved internally because of updates, chaos will arise due to (offset, size) > tuples being out of sync. > > Even > > > Le 24/07/2024 à 00:37, Michael Sumner via gdal-dev a écrit : > > Hi, is there any effort or thought into something like Python's kerchunk > in GDAL? (my summary of kerchunk is below) > > https://github.com/fsspec/kerchunk > > I'll be exploring the python outputs in detail and looking for hooks into > where we might bring some of this tighter into GDAL. This would work > nicely inside the GTI driver, for example. But, a *kerchunk-driver*? That > would be in the family of raw/ drivers, my skillset won't have much to > offer but I'm going to explore with some simpler examples. It could even > bring old HDF4 files into the fold, I think. > > It's a bit weird from a GDAL perspective to map the chunks in a format for > which we have a driver, but there's definitely performance advantages and > convenience for virtualizing huge disparate collections (even the simplest > time-series-of-files in netcdf is nicely abstracted here for xarray, a > super-charged VRT for xarray). > > Interested in any thoughts, feedback, pointers to related efforts ... > thanks! > > (my take on) A description of kerchunk: > > kerchunk replaces the actual binary blobs on file in a Zarr with json > references to a file/uri/object and the byte start and end values, in this > way kerchunk brings formats like hdf/netcdf/grib into the fold of "cloud > readiness" by having a complete separation of metadata from the actual > storage. The information about those chunks (compression, type, orientation > etc is stored in json also). > > (a Zarr is a multidimensional version of a single-zoom-level image > tiling, imagine every image tile as a potentially n-dimensional child block > of a larger array. The blobs are stored like one zoom of an z/y/x tile > server [[[v/]w/]y/]x way (with a position for each dimension of the array, > 1, 2, 3, 4, or n, and z is not special, and with more general encoding > possibilities than tif/png/jpeg provide.) This scheme is extremely > general, literally a virtualized array-like abstraction on any storage, > and with kerchunk you can transcend many legacy issues with actual formats. > > Cheers, Mike > > > -- > Michael Sumner > Research Software Engineer > Australian Antarctic Division > Hobart, Australia > e-mail: mdsum...@gmail.com > > _______________________________________________ > gdal-dev mailing > listgdal-dev@lists.osgeo.orghttps://lists.osgeo.org/mailman/listinfo/gdal-dev > > -- http://www.spatialys.com > My software is free, but my time generally not. > > -- Michael Sumner Research Software Engineer Australian Antarctic Division Hobart, Australia e-mail: mdsum...@gmail.com
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev