paleolimbot commented on issue #380:
URL: https://github.com/apache/sedona-db/issues/380#issuecomment-3590201372

   I think the immediate issue is that the `dyn PhysicalExpr` filter we are 
passed has the correct column *name* but an incorrect column *index* for the 
case of `SUM()` or `count()` or some types of projections where the source 
column index (i.e., 2) is not a valid index in the final schema. In my testing, 
any final schema that has >= 3 columns executes almost instantly without any 
changes. I'm not sure if this is a DataFusion bug or not but I will investigate 
and file something if it is.
   
   In the meantime I'll update this to look for field names instead of indices 
since that seems to be propagated correctly.
   
   <details>
   
   Query where the
   
   ```
   >>> sd.sql(f"""
   ... SELECT population, "GISJOIN", population::INTEGER AS foofy, "GISJOIN" as 
foofy2 FROM population_areas
   ... WHERE ST_Intersects(wkb_geometry, ST_SetSRID(ST_GeomFromWKT('{wkt}'), 
4326))
   ... """).show()
   filter = PyFilter { inner: ScalarFunctionExpr { fun: "<FUNC>", name: 
"st_intersects", args: [Column { name: "wkb_geometry", index: 2 }, Literal { 
value: 
Binary("1,3,0,0,0,1,0,0,0,5,0,0,0,158,238,60,241,156,126,82,192,107,71,113,142,58,98,68,64,86,130,197,225,204,124,82,192,107,71,113,142,58,98,68,64,86,130,197,225,204,124,82,192,185,142,113,197,197,101,68,64,158,238,60,241,156,126,82,192,185,142,113,197,197,101,68,64,158,238,60,241,156,126,82,192,107,71,113,142,58,98,68,64"),
 field: Field { name: "lit", data_type: Binary, nullable: false, dict_id: 0, 
dict_is_ordered: false, metadata: {"ARROW:extension:metadata": 
"{\"crs\":\"EPSG:4326\"}", "ARROW:extension:name": "geoarrow.wkb"} } }], 
return_field: Field { name: "", data_type: Boolean, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} } } }
   file_schema = SedonaSchema with 3 fields:
     population: utf8<Utf8>
     GISJOIN: utf8<Utf8>
     wkb_geometry: geometry<Wkb(epsg:4326)>
   geometry_column_indices = [2]
   [python/sedonadb/src/datasource.rs:288:9] &filter = Intersects(
       Column {
           name: "wkb_geometry",
           index: 2,
       },
       BoundingBox {
           x: WraparoundInterval {
               inner: Interval {
                   lo: -73.978329,
                   hi: -73.950005,
               },
           },
           y: Interval {
               lo: 40.767412,
               hi: 40.795098,
           },
           z: None,
           m: None,
       },
   )
   [python/sedonadb/src/datasource.rs:290:9] &filter_bbox = BoundingBox {
       x: WraparoundInterval {
           inner: Interval {
               lo: -73.978329,
               hi: -73.950005,
           },
       },
       y: Interval {
           lo: 40.767412,
           hi: 40.795098,
       },
       z: None,
       m: None,
   }
   
pyogrio.raw.ogr_open_arrow('/vsicurl/https://flatgeobuf.septima.dk/population_areas.fgb',
 {}, columns=None, batch_size=8192, bbox=(-73.978329, 40.767412, -73.950005, 
40.795098)),
   About to call next()
   About to call next()
   ┌────────────┬────────────────────┬───────┬────────────────────┐
   │ population ┆       GISJOIN      ┆ foofy ┆       foofy2       │
   │    utf8    ┆        utf8        ┆ int32 ┆        utf8        │
   ╞════════════╪════════════════════╪═══════╪════════════════════╡
   ```
   
   Same query, calling `.count()`:
   
   ```
   >>> sd.sql(f"""
   ... SELECT population, "GISJOIN", population::INTEGER AS foofy, "GISJOIN" as 
foofy2 FROM population_areas
   ... WHERE ST_Intersects(wkb_geometry, ST_SetSRID(ST_GeomFromWKT('{wkt}'), 
4326))
   ... """).count()
   filter = PyFilter { inner: ScalarFunctionExpr { fun: "<FUNC>", name: 
"st_intersects", args: [Column { name: "wkb_geometry", index: 0 }, Literal { 
value: 
Binary("1,3,0,0,0,1,0,0,0,5,0,0,0,158,238,60,241,156,126,82,192,107,71,113,142,58,98,68,64,86,130,197,225,204,124,82,192,107,71,113,142,58,98,68,64,86,130,197,225,204,124,82,192,185,142,113,197,197,101,68,64,158,238,60,241,156,126,82,192,185,142,113,197,197,101,68,64,158,238,60,241,156,126,82,192,107,71,113,142,58,98,68,64"),
 field: Field { name: "lit", data_type: Binary, nullable: false, dict_id: 0, 
dict_is_ordered: false, metadata: {"ARROW:extension:name": "geoarrow.wkb", 
"ARROW:extension:metadata": "{\"crs\":\"EPSG:4326\"}"} } }], return_field: 
Field { name: "", data_type: Boolean, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} } } }
   file_schema = SedonaSchema with 3 fields:
     population: utf8<Utf8>
     GISJOIN: utf8<Utf8>
     wkb_geometry: geometry<Wkb(epsg:4326)>
   geometry_column_indices = [2]
   [python/sedonadb/src/datasource.rs:288:9] &filter = Intersects(
       Column {
           name: "wkb_geometry",
           index: 0,
       },
       BoundingBox {
           x: WraparoundInterval {
               inner: Interval {
                   lo: -73.978329,
                   hi: -73.950005,
               },
           },
           y: Interval {
               lo: 40.767412,
               hi: 40.795098,
           },
           z: None,
           m: None,
       },
   )
   [python/sedonadb/src/datasource.rs:290:9] &filter_bbox = BoundingBox {
       x: WraparoundInterval {
           inner: Interval {
               lo: -inf,
               hi: inf,
           },
       },
       y: Interval {
           lo: -inf,
           hi: inf,
       },
       z: None,
       m: None,
   }
   
pyogrio.raw.ogr_open_arrow('/vsicurl/https://flatgeobuf.septima.dk/population_areas.fgb',
 {}, columns=['wkb_geometry'], batch_size=8192, bbox=None),
   About to call next()
   Waiting for GIL from sleep()
   About to call next()
   Waiting for GIL from sleep()
   About to call next()
   Waiting for GIL from sleep()
   About to call next()
   ^CWaiting for GIL from sleep()
   ```
   
   </details>
   
   Of course, this doesn't preclude other weirdness that may be occurring but 
the cases you've noted above seem to be consistent with the issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to