Hi Elliot, Given your description, I agree extension types sound like they may be a good idea, similar to geoarrow[1] for Geospatial data where there is extra metadata[2] needed to interpret underlying types (e.g. factor and offset)
Andrew [1] https://github.com/geoarrow/geoarrow [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html#geoarrow On Sat, Jan 6, 2024 at 3:20 AM Morrison-Reed Elliot (BEG/PJ-EDS-NA) <elliot.morrison-r...@us.bosch.com.invalid> wrote: > Background > > I have been looking into using parquet files for storing and working with > automotive data. One interesting thing about automotive data is that most > communication happens on the CAN bus where we have extremely limited > bandwidth. > In order to encode "physical" values in a very space efficient way, we > use linear conversion formulas that look like "phys = (raw * factor) + > offset". > This gives implicit range and resolution limits, but that is often just > fine > when we are representing a physical property. > > Example 1: > > We have a throttle that can be anywhere from 0-100% and we want to fit that > value into 1 byte. So we would use a formula like: > > phys = (raw * 0.39215) + 0 > > Example 2: > > We want to record ambient temperature of the vehicle. Resolution of 1 > degree is > fine. Also, temperatures below -40 and above 215 degrees C are not > particularly > useful as they are very rare and out of scope for a useful temperature. > > phys = (raw * 1.0) - 40 > > So far, I have been converting the raw data into floating point data before > writing to arrow format to make it easier for the analysts to use the > data. This of course means that I am converting to a less efficient format > and I > am also losing inherent information about the raw signal. I would rather > be able > to store the raw data in an appropriately sized unsigned integer and > automatically convert to floating point when using the data, similar to > dictionary encoding. > > Discussion > > - How would people generally deal with this situation using the arrow > format? > - Is this something that other people are interested in? > - If this were to be added to the spec, what would be the best way to do > it? > > While I am coming from an automotive perspective, I think there are many > other > areas of applicability (reading sensor data through an ADC, industrial > automation and monitoring, etc.) > > I could see this working as either a new primitive type (similar to > decimal), or > as an extension where we simply put the factor and offset as standard > metadata > fields. > > Best regards, > Elliot Morrison-Reed > >