paleolimbot commented on code in PR #2971:
URL: https://github.com/apache/parquet-java/pull/2971#discussion_r2028083658


##########
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##########
@@ -128,6 +133,8 @@ public static WriterVersion fromString(String name) {
   private final Map<String, String> extraMetaData;
   private final ColumnProperty<Boolean> statistics;
   private final ColumnProperty<Boolean> sizeStatistics;
+  private final ColumnProperty<Boolean> geospatialStatistics;
+  private final ColumnProperty<Boolean> geospatialStatisticsBBoxWraparound;

Review Comment:
   Should these be removed now that wraparound is not included in this PR?



##########
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java:
##########
@@ -775,6 +811,33 @@ public static Statistics toParquetStatistics(
     return formatStats;
   }
 
+  private static BoundingBox 
toParquetBoundingBox(org.apache.parquet.column.statistics.geometry.BoundingBox 
bbox) {
+    BoundingBox formatBbox = new BoundingBox();
+
+    // Use the same NaN check pattern for all coordinates
+    if (!Double.isNaN(bbox.getXMin()) && !Double.isNaN(bbox.getXMax())) {
+      formatBbox.setXmin(bbox.getXMin());
+      formatBbox.setXmax(bbox.getXMax());
+    }
+
+    if (!Double.isNaN(bbox.getYMin()) && !Double.isNaN(bbox.getYMax())) {
+      formatBbox.setYmin(bbox.getYMin());
+      formatBbox.setYmax(bbox.getYMax());
+    }

Review Comment:
   Because `Xmin`, `Ymin`, `Xmax`, and `Ymax` aren't `optional` in Thrift, what 
happens if you don't set them here? I believe this is where we would choose 
what happens specifically for the empty x/y case (i.e., either `NaNs`, 
`Inf/-Inf`, or by setting `formatBbox` to `null` and omitting it from the 
output Thrift.



##########
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java:
##########
@@ -0,0 +1,257 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.statistics.geometry;
+
+import org.apache.parquet.Preconditions;
+import org.locationtech.jts.geom.Coordinate;
+import org.locationtech.jts.geom.Envelope;
+import org.locationtech.jts.geom.Geometry;
+
+public class BoundingBox {
+
+  private double xMin = Double.NaN;
+  private double xMax = Double.NaN;
+  private double yMin = Double.NaN;
+  private double yMax = Double.NaN;
+  private double zMin = Double.NaN;
+  private double zMax = Double.NaN;
+  private double mMin = Double.NaN;
+  private double mMax = Double.NaN;

Review Comment:
   I believe that @mkaravel's comment suggested that this would be cleaner in 
its original form (i.e., with the initial/empty state as `Inf/-Inf`, which 
simplified the `update` methods), although there's nothing incorrect about 
keeping the initial/empty state as `NaN` either if that works for everybody!



##########
parquet-column/src/test/java/org/apache/parquet/column/statistics/geometry/TestBoundingBox.java:
##########
@@ -0,0 +1,492 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.statistics.geometry;
+
+import org.junit.Assert;
+import org.junit.Test;
+import org.locationtech.jts.geom.Coordinate;
+import org.locationtech.jts.geom.CoordinateXYZM;
+import org.locationtech.jts.geom.GeometryFactory;
+import org.locationtech.jts.geom.Point;
+
+public class TestBoundingBox {
+
+  @Test
+  public void testUpdate() {
+    GeometryFactory geometryFactory = new GeometryFactory();
+    BoundingBox boundingBox = new BoundingBox();
+
+    // Create a 2D point
+    Point point2D = geometryFactory.createPoint(new Coordinate(10, 20));
+    boundingBox.update(point2D, "EPSG:4326");
+    Assert.assertEquals(10.0, boundingBox.getXMin(), 0.0);
+    Assert.assertEquals(10.0, boundingBox.getXMax(), 0.0);
+    Assert.assertEquals(20.0, boundingBox.getYMin(), 0.0);
+    Assert.assertEquals(20.0, boundingBox.getYMax(), 0.0);
+  }
+
+  @Test
+  public void testEmptyGeometry() {
+    GeometryFactory geometryFactory = new GeometryFactory();
+    BoundingBox boundingBox = new BoundingBox();
+
+    // Create an empty point
+    Point emptyPoint = geometryFactory.createPoint();
+    boundingBox.update(emptyPoint, "EPSG:4326");
+
+    // Empty geometry should retain the initial -Inf/Inf state

Review Comment:
   A number of these comments still refer to `-Inf/Inf` (but the values are now 
`NaN`)



##########
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeospatialStatistics.java:
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.statistics.geometry;
+
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType;
+import org.locationtech.jts.geom.Geometry;
+import org.locationtech.jts.io.ParseException;
+import org.locationtech.jts.io.WKBReader;
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of geospatial data written.
+ */
+public class GeospatialStatistics {
+
+  public static final String DEFAULT_GEOSPATIAL_STAT_CRS = "OGC:CRS84";
+
+  // Metadata that may impact the statistics calculation
+  private final String crs;
+  private final BoundingBox boundingBox;
+  private final EdgeInterpolationAlgorithm edgeAlgorithm;
+  private final GeospatialTypes geospatialTypes;
+
+  /**
+   * Whether the statistics has valid value.
+   *
+   * It is true by default. Only set to false while it fails to merge 
statistics.
+   */
+  private boolean valid = true;
+
+  /**
+   * Merge the statistics from another GeospatialStatistics object.
+   *
+   * @param other the other GeospatialStatistics object
+   */
+  public void mergeStatistics(GeospatialStatistics other) {
+    if (!valid) return;
+
+    if (other == null) {
+      return;
+    }
+    if (this.boundingBox != null && other.boundingBox != null) {
+      this.boundingBox.merge(other.boundingBox);
+    }
+    if (this.geospatialTypes != null && other.geospatialTypes != null) {
+      this.geospatialTypes.merge(other.geospatialTypes);
+    }
+  }
+
+  /**
+   * Builder to create a GeospatialStatistics.
+   */
+  public static class Builder {
+    private final String crs;
+    private BoundingBox boundingBox;
+    private GeospatialTypes geospatialTypes;
+    private EdgeInterpolationAlgorithm edgeAlgorithm;
+    private final WKBReader reader = new WKBReader();
+
+    /**
+     * Create a builder to create a GeospatialStatistics.
+     * For Geometry type, edgeAlgorithm is not required.
+     *
+     * @param crs the coordinate reference system
+     */
+    public Builder(String crs) {
+      this.crs = crs;
+      this.boundingBox = new BoundingBox();
+      this.geospatialTypes = new GeospatialTypes();
+      this.edgeAlgorithm = null;
+    }
+
+    /**
+     * Create a builder to create a GeospatialStatistics.
+     * For Geography type, optional edgeAlgorithm can be set.
+     *
+     * @param crs the coordinate reference system
+     */
+    public Builder(String crs, EdgeInterpolationAlgorithm edgeAlgorithm) {
+      this.crs = crs;
+      this.boundingBox = new BoundingBox();
+      this.geospatialTypes = new GeospatialTypes();
+      this.edgeAlgorithm = edgeAlgorithm;
+    }
+
+    public void update(Binary value) {
+      if (value == null) {
+        return;
+      }
+      try {
+        Geometry geom = reader.read(value.getBytes());
+        update(geom);
+      } catch (ParseException e) {
+        abort();
+      }
+    }
+
+    private void update(Geometry geom) {
+      boundingBox.update(geom, crs);
+      geospatialTypes.update(geom);
+    }
+
+    public void abort() {
+      boundingBox.abort();
+      geospatialTypes.abort();
+    }
+
+    /**
+     * Build a GeospatialStatistics from the builder.
+     *
+     * @return a new GeospatialStatistics object
+     */
+    public GeospatialStatistics build() {
+      return new GeospatialStatistics(crs, boundingBox, geospatialTypes, 
edgeAlgorithm);
+    }
+  }
+
+  /**
+   * Create a new GeospatialStatistics builder with the specified CRS.
+   *
+   * @param type the primitive type
+   * @return a new GeospatialStatistics builder
+   */
+  public static GeospatialStatistics.Builder newBuilder(PrimitiveType type) {
+    LogicalTypeAnnotation logicalTypeAnnotation = 
type.getLogicalTypeAnnotation();
+    if (logicalTypeAnnotation instanceof 
LogicalTypeAnnotation.GeometryLogicalTypeAnnotation) {
+      String crs = ((LogicalTypeAnnotation.GeometryLogicalTypeAnnotation) 
logicalTypeAnnotation).getCrs();
+      return new GeospatialStatistics.Builder(crs);
+    } else if (logicalTypeAnnotation instanceof 
LogicalTypeAnnotation.GeographyLogicalTypeAnnotation) {
+      String crs = ((LogicalTypeAnnotation.GeographyLogicalTypeAnnotation) 
logicalTypeAnnotation).getCrs();
+      EdgeInterpolationAlgorithm edgeAlgorithm =
+          ((LogicalTypeAnnotation.GeographyLogicalTypeAnnotation) 
logicalTypeAnnotation).getEdgeAlgorithm();
+      return new GeospatialStatistics.Builder(crs, edgeAlgorithm);
+    } else {
+      return noopBuilder();
+    }
+  }
+
+  /**
+   * Constructs a GeospatialStatistics object with the specified CRS, bounding 
box, and geospatial types.
+   *
+   * @param crs the coordinate reference system
+   * @param boundingBox the bounding box for the geospatial data
+   * @param geospatialTypes the geospatial types
+   */
+  public GeospatialStatistics(
+      String crs,
+      BoundingBox boundingBox,
+      GeospatialTypes geospatialTypes,
+      EdgeInterpolationAlgorithm edgeAlgorithm) {
+    this.crs = crs;
+    this.boundingBox = boundingBox;
+    this.geospatialTypes = geospatialTypes;
+    this.edgeAlgorithm = edgeAlgorithm;
+  }
+
+  /**
+   * Constructs a GeospatialStatistics object with the specified CRS.
+   *
+   * @param crs the coordinate reference system
+   */
+  public GeospatialStatistics(String crs) {
+    this(crs, new BoundingBox(), new GeospatialTypes(), null);
+  }
+
+  /**
+   * Constructs a GeospatialStatistics object with the specified CRS and edge 
interpolation algorithm.
+   *
+   * @param crs the coordinate reference system
+   * @param edgeAlgorithm the edge interpolation algorithm
+   */
+  public GeospatialStatistics(String crs, EdgeInterpolationAlgorithm 
edgeAlgorithm) {
+    this.crs = crs;
+    this.boundingBox = new BoundingBox();
+    this.geospatialTypes = new GeospatialTypes();
+    this.edgeAlgorithm = edgeAlgorithm;
+  }
+
+  /** Returns the coordinate reference system. */
+  public BoundingBox getBoundingBox() {
+    return boundingBox;
+  }
+
+  /** Returns the geometry types. */
+  public GeospatialTypes getGeospatialTypes() {
+    return geospatialTypes;
+  }
+
+  /**
+   * @return whether the statistics has valid value.
+   */
+  public boolean isValid() {
+    return valid;
+  }
+
+  public void merge(GeospatialStatistics other) {
+    if (!valid) return;
+    Preconditions.checkArgument(other != null, "Cannot merge with null 
GeometryStatistics");
+
+    if (boundingBox != null && other.boundingBox != null) {
+      boundingBox.merge(other.boundingBox);
+    }
+
+    if (geospatialTypes != null && other.geospatialTypes != null) {
+      geospatialTypes.merge(other.geospatialTypes);
+    }
+  }
+
+  // Copy the statistics
+  public GeospatialStatistics copy() {
+    return new GeospatialStatistics(
+        crs,
+        boundingBox != null ? boundingBox.copy() : null,
+        geospatialTypes != null ? geospatialTypes.copy() : null,
+        null);
+  }
+
+  @Override
+  public String toString() {
+    return "GeospatialStatistics{" + "boundingBox=" + boundingBox + ", 
coverings=" + geospatialTypes + '}';

Review Comment:
   ```suggestion
       return "GeospatialStatistics{" + "boundingBox=" + boundingBox + ", 
geospatialTypes=" + geospatialTypes + '}';
   ```
   
   Maybe?



##########
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeospatialTypes.java:
##########
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.statistics.geometry;
+
+import java.util.HashSet;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.parquet.Preconditions;
+import org.locationtech.jts.geom.Coordinate;
+import org.locationtech.jts.geom.Geometry;
+
+public class GeospatialTypes {
+
+  private static final int UNKNOWN_TYPE_ID = -1;
+  private Set<Integer> types = new HashSet<>();
+  private boolean valid = true;
+
+  public GeospatialTypes(Set<Integer> types) {
+    this.types = types;
+  }
+
+  public GeospatialTypes() {}
+
+  public Set<Integer> getTypes() {
+    return types;
+  }
+
+  void update(Geometry geometry) {
+    if (!valid) {
+      return;
+    }
+    int code = getGeometryTypeCode(geometry);
+    if (code != UNKNOWN_TYPE_ID) {
+      types.add(code);
+    } else {
+      valid = false;
+      types.clear();
+    }
+  }
+
+  public void merge(GeospatialTypes other) {
+    Preconditions.checkArgument(other != null, "Cannot merge with null 
GeospatialTypes");
+    if (!valid) {
+      return;
+    }
+    if (!other.valid) {
+      valid = false;
+      types.clear();
+      return;
+    }
+    types.addAll(other.types);
+  }
+
+  public void reset() {
+    types.clear();
+    valid = true;
+  }
+
+  public void abort() {
+    valid = false;
+    types.clear();
+  }
+
+  public GeospatialTypes copy() {
+    return new GeospatialTypes(new HashSet<>(types));
+  }
+
+  @Override
+  public String toString() {
+    return "GeospatialTypes{" + "types="
+        + types.stream().map(this::typeIdToString).collect(Collectors.toSet()) 
+ '}';
+  }
+
+  private int getGeometryTypeId(Geometry geometry) {
+    switch (geometry.getGeometryType()) {
+      case Geometry.TYPENAME_POINT:
+        return 1;
+      case Geometry.TYPENAME_LINESTRING:
+        return 2;
+      case Geometry.TYPENAME_POLYGON:
+        return 3;
+      case Geometry.TYPENAME_MULTIPOINT:
+        return 4;
+      case Geometry.TYPENAME_MULTILINESTRING:
+        return 5;
+      case Geometry.TYPENAME_MULTIPOLYGON:
+        return 6;
+      case Geometry.TYPENAME_GEOMETRYCOLLECTION:
+        return 7;
+      default:
+        return UNKNOWN_TYPE_ID;
+    }
+  }
+
+  /**
+   * Geospatial type codes:
+   *
+   * | Type               | XY   | XYZ  | XYM  | XYZM |
+   * | :----------------- | :--- | :--- | :--- | :--: |
+   * | Point              | 0001 | 1001 | 2001 | 3001 |
+   * | LineString         | 0002 | 1002 | 2002 | 3002 |
+   * | Polygon            | 0003 | 1003 | 2003 | 3003 |
+   * | MultiPoint         | 0004 | 1004 | 2004 | 3004 |
+   * | MultiLineString    | 0005 | 1005 | 2005 | 3005 |
+   * | MultiPolygon       | 0006 | 1006 | 2006 | 3006 |
+   * | GeometryCollection | 0007 | 1007 | 2007 | 3007 |
+   *
+   * See 
https://github.com/apache/parquet-format/blob/master/Geospatial.md#geospatial-types
+   */
+  private int getGeometryTypeCode(Geometry geometry) {
+    int typeId = getGeometryTypeId(geometry);
+    if (typeId == UNKNOWN_TYPE_ID) {
+      return UNKNOWN_TYPE_ID;
+    }
+    Coordinate[] coordinates = geometry.getCoordinates();
+    boolean hasZ = false;
+    boolean hasM = false;
+    for (Coordinate coordinate : coordinates) {

Review Comment:
   Would `CoordinateSequence.hasZ()` and/or `CoordinateSequence.hasM()` be 
useful here? (Or is there some way to extract this without looping over every 
coordinate?)
   
   Technically this will always return an EMPTY geometry as XY as written. I'm 
OK with this because other libraries also loose this type of information (e.g., 
GEOS will drop the Z/Mness of a MULTIxxx ZM) but worth noting.



##########
parquet-column/src/main/java/org/apache/parquet/column/page/PageWriter.java:
##########


Review Comment:
   Are the changes in this file intentional?



##########
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java:
##########
@@ -41,8 +41,13 @@
 import java.util.Set;
 import java.util.function.Supplier;
 import org.apache.parquet.Preconditions;
+import 
org.apache.parquet.column.statistics.geometry.EdgeInterpolationAlgorithm;
 
 public abstract class LogicalTypeAnnotation {
+
+  // TODO: Move this to an external configuration
+  public static final String DEFAULT_GEOMETRY_CRS = "OGC:CRS84";

Review Comment:
   Is this TODO still needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to