paleolimbot commented on code in PR #176:
URL: https://github.com/apache/sedona-db/pull/176#discussion_r2400522763
##########
python/sedonadb/python/sedonadb/dataframe.py:
##########
@@ -151,6 +151,29 @@ def count(self) -> int:
"""
return self._impl.count()
+ def __len__(self) -> int:
+ """Compute the number of rows in the DataFrame"""
+ return self.count()
+
+ @property
+ def columns(self) -> list[str]:
+ """Return the column names in the DataFrame"""
+ columns = list()
+ field_index = 0
+ while True:
+ try:
+ columns.append(self._impl.schema().field(field_index).name)
+ field_index += 1
+ except IndexError:
+ break
+
+ return columns
Review Comment:
Agreed...I mostly just didn't expose enough from the schema object when I
wrote it 😬
##########
python/sedonadb/python/sedonadb/dataframe.py:
##########
@@ -151,6 +151,29 @@ def count(self) -> int:
"""
return self._impl.count()
+ def __len__(self) -> int:
+ """Compute the number of rows in the DataFrame"""
+ return self.count()
+
+ @property
+ def columns(self) -> list[str]:
+ """Return the column names in the DataFrame"""
+ columns = list()
+ field_index = 0
+ while True:
+ try:
+ columns.append(self._impl.schema().field(field_index).name)
+ field_index += 1
+ except IndexError:
+ break
+
+ return columns
+
+ @property
+ def shape(self) -> tuple[int, int]:
+ """Return the shape of the DataFrame as a tuple of integers
corresponding to (rows, columns)"""
+ return self.count(), len(self.columns)
Review Comment:
Neither Ibis nor DuckDB implement a `.shape` accessor. Given that this would
also trigger execution, I don't think it's a good idea to include this 😬
##########
python/sedonadb/python/sedonadb/dataframe.py:
##########
@@ -151,6 +151,29 @@ def count(self) -> int:
"""
return self._impl.count()
+ def __len__(self) -> int:
+ """Compute the number of rows in the DataFrame"""
+ return self.count()
Review Comment:
I wonder if we should include this or not. Our "DataFrame" hasn't been
materialized yet and might well take quite a long time to do so.
As a data point, Ibis implements `__len__()` but returns an error:
```
ExpressionError: Use .count() instead
```
Another data point...duckdb implements `__len__()` and executes the query
like this implementation. I'm not sure what pyspark does here.
I would personally lean towards the Ibis approach (return an error forcing a
user to explicitly count).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]