> Where I am struggling a little bit is to understand at what level those
compute functions should be implemented. As far as I can tell, when I load
a dictionary encoded arrow into a Pandas data frame or made a query using
DataFusion, the user can then just operate as if they are working directly
with a string array. Is that implemented in the arrow libraries, or does
each "application" (pandas, DataFusion, etc.) have their own implementation?

It is generally implemented by arrow libraries (for example, pandas uses
py-arrow, which uses the C++ Apache Arrow implementation, and DataFusion
uses arrow-rs, the Rust Apache Arrow implementation)

On Mon, Jan 8, 2024 at 12:22 PM Morrison-Reed Elliot (BEG/PJ-EDS-NA)
<elliot.morrison-r...@us.bosch.com.invalid> wrote:

> Thanks for the hint.
>
> After reading through the geoarrow spec, I think I agree that this is
> probably the best approach.
>
> As far as I can tell all that is required is a standardized set of
> metadata tags and then some well implemented compute functions that can
> easily project the raw to physical interpretations.
>
> Where I am struggling a little bit is to understand at what level those
> compute functions should be implemented. As far as I can tell, when I load
> a dictionary encoded arrow into a Pandas data frame or made a query using
> DataFusion, the user can then just operate as if they are working directly
> with a string array. Is that implemented in the arrow libraries, or does
> each "application" (pandas, DataFusion, etc.) have their own implementation?
>
> Best regards,
> Elliot Morrison-Reed
>
> -----Original Message-----
> From: Andrew Lamb <al...@influxdata.com>
> Sent: Saturday, January 6, 2024 8:22 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] Linear Formula Types
>
> Hi Elliot,
>
> Given your description, I agree extension types sound like they may be a
> good idea, similar to geoarrow[1] for Geospatial data where there is extra
> metadata[2] needed to interpret underlying types (e.g. factor and offset)
>
> Andrew
>
> [1] https://github.com/geoarrow/geoarrow
> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html#geoarrow
>
> On Sat, Jan 6, 2024 at 3:20 AM Morrison-Reed Elliot (BEG/PJ-EDS-NA)
> <elliot.morrison-r...@us.bosch.com.invalid> wrote:
>
> > Background
> >
> > I have been looking into using parquet files for storing and working
> > with automotive data. One interesting thing about automotive data is
> > that most communication happens on the CAN bus where we have extremely
> > limited bandwidth.
> > In order to encode "physical" values in a very space efficient way, we
> > use linear conversion formulas that look like "phys = (raw * factor) +
> > offset".
> > This gives implicit range and resolution limits, but that is often
> > just fine when we are representing a physical property.
> >
> > Example 1:
> >
> > We have a throttle that can be anywhere from 0-100% and we want to fit
> > that value into 1 byte. So we would use a formula like:
> >
> >     phys = (raw * 0.39215) + 0
> >
> > Example 2:
> >
> > We want to record ambient temperature of the vehicle. Resolution of 1
> > degree is fine. Also, temperatures below -40 and above 215 degrees C
> > are not particularly useful as they are very rare and out of scope for
> > a useful temperature.
> >
> >     phys = (raw * 1.0) - 40
> >
> > So far, I have been converting the raw data into floating point data
> > before writing to arrow format to make it easier for the analysts to
> > use the data. This of course means that I am converting to a less
> > efficient format and I am also losing inherent information about the
> > raw signal. I would rather be able to store the raw data in an
> > appropriately sized unsigned integer and automatically convert to
> > floating point when using the data, similar to dictionary encoding.
> >
> > Discussion
> >
> > - How would people generally deal with this situation using the arrow
> > format?
> > - Is this something that other people are interested in?
> > - If this were to be added to the spec, what would be the best way to
> > do it?
> >
> > While I am coming from an automotive perspective, I think there are
> > many other areas of applicability (reading sensor data through an ADC,
> > industrial automation and monitoring, etc.)
> >
> > I could see this working as either a new primitive type (similar to
> > decimal), or as an extension where we simply put the factor and offset
> > as standard metadata fields.
> >
> > Best regards,
> > Elliot Morrison-Reed
> >
> >
>

Reply via email to