Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Julian Hyde Fri, 25 Jun 2021 11:51:23 -0700

Cc += geospatial@.

I think allowing WKB and WKT is sufficient.


Perhaps Geometry could be a composite type (WKT, SRID) or (WKB, SRID). SRID 
(spatial reference identifier) is almost always needed to qualify a geometry 
value. It is analogous to how TimeZone is needed (implicitly or explicitly) to 
qualify a DateTime value.

For Geospatial queries to perform well requires some kind of indexing (and/or 
clever data organization). Geospatial indexing is very complex, and there is no 
“one size fits all” approach. So I recommend that Arrow stays out of the 
indexing business, and leaves indexing to the engine.

Julian


> On Jun 25, 2021, at 10:17 AM, Mauricio Vargas <mavarga...@uc.cl.INVALID> 
> wrote:
> 
> Dear Jon
> 
> Thanks for sending this. Based on previous projects, WKB works well with
> SQLite, DuckDB and others, at the expense of creating heavier size columns
> compared to PostGIS.
> 
> In order to experiment with, it can be interesting to use the CENSO 2017
> shape files: https://github.com/ropensci/censo2017-cartografias;
> https://github.com/ropensci/censo2017-cartografias/releases/download/v0.4/cartografias-censo2017.zip
> This includes rivers, streets, etc etc.
> 
> Provided that Arrow is installed in a very straightforward way (for
> Windows, at least), creating something based on PostGIS is probably not a
> bad idea, but WKB works ok, and it integrates with 0 problems with the SF
> package. I clearly see a great compression advantage here if we decide to
> use WKB, as LZ4 shall make it very lightweight compared to, say, a CSV.
> 
> Best,
> 
> 
> 
> 
> 
> 
> 
> On Fri, Jun 25, 2021 at 1:05 PM Jonathan Keane <jke...@gmail.com> wrote:
> 
>> Hello,
>> 
>> There is an emerging spec[1] for how to store geospatial data in Arrow
>> + pass through parquet files in the geopandas world. There is even a
>> new R package that implements a wrapper to do the same in R[2]. These
>> both define a serialization[3] for storing geospatial data as an Arrow
>> table (and thus also when saving to parquet with Arrow).
>> 
>> I could see a number of ways that we might interact with standards
>> like these, and for any of these that we pursue it would be good to
>> clarify that in our docs:
>> 
>> 1. Point to the standard — we could mention that this standard exists
>> and that if someone is building a geospatial data aware application,
>> they _could_ refer to this standard if they want to.
>> 2. Adopt a/this standard — this could range from stating that we've
>> adopted it as the way that spatial data _ought_ to be stored to asking
>> the creators if maintaining it within the Arrow project itself would
>> be better (either by adopting it or creating a fork — of course
>> communication with the folks working on it now would be critical!)
>> 3. Create extension type(s) for geospatial data — this would require
>> adopting a standard like the one linked, but on top of that providing
>> an extension type within Arrow itself that the various clients could
>> implement as they saw fit.
>> 4. Create new, fully separate type(s) for geospatial data — again,
>> this would require adopting a standard of some sort, but we would
>> implement it as a specific type and presumably support it in all of
>> the clients as we could.
>> 
>> There are of course pros and cons to all of these. This type of data
>> *is* somewhat specialized and I don't think we want to have a huge
>> profusion of types for all of the possible specialized data types out
>> there. But, at a minimum we should acknowledge (or adopt) a standard
>> if it exists and encourage implementations that use Arrow to follow
>> that standard (like sfarrow does to be compatible with geopandas) so
>> that some level of interoperability is there + people aren't needing
>> to reinvent the wheel each time they store spatial data.
>> 
>> Thoughts? Are there other projects out there that already do something
>> like this with Arrow that we should consider?
>> 
>> [1] https://github.com/geopandas/geo-arrow-spec/pull/2
>> [2] https://github.com/wcjochem/sfarrow
>> [3] for now they create a binary WKB column + attach a bit of metadata
>> to the schema that that's what happened, though there are other ways
>> one could encode this and the spec might include other way(s) to store
>> this data in the future.
>> 
>> -Jon
>> 
> 
> 
> -- 
> —
> *Mauricio 'Pachá' Vargas Sepúlveda*
> Site: pacha.dev
> Blog: pacha.dev/blog

Re: [Discuss] If and how we should integrate geospatial data (specs) in Arrow

Reply via email to