timsaucer commented on code in PR #915:
URL: https://github.com/apache/datafusion-python/pull/915#discussion_r1798327582
##########
python/datafusion/dataframe.py:
##########
@@ -223,6 +223,30 @@ def limit(self, count: int, offset: int = 0) -> DataFrame:
"""
return DataFrame(self.df.limit(count, offset))
+ def head(self, n: int) -> DataFrame:
Review Comment:
Would it be helpful to have a default `n`?
##########
python/datafusion/dataframe.py:
##########
@@ -223,6 +223,30 @@ def limit(self, count: int, offset: int = 0) -> DataFrame:
"""
return DataFrame(self.df.limit(count, offset))
+ def head(self, n: int) -> DataFrame:
+ """Return a new :py:class:`DataFrame` with a limited number of rows.
+
+ Args:
+ n: Number of rows to take from the head of the DataFrame.
+
+ Returns:
+ DataFrame after limiting.
+ """
+ return DataFrame(self.df.limit(n, 0))
+
+ def tail(self, n: int) -> DataFrame:
+ """Return a new :py:class:`DataFrame` with a limited number of rows.
+
+ Be aware this could be potentially expensive due to the size of the
frame.
+
Review Comment:
Is there a better way we could do this? Maybe add something upstream if
necessary?
As I'm thinking of it, I don't know that this operation is necessarily well
defined. Just like with `limit` when you call it multiple times on a large
dataframe you get different results, I would expect different results from
multiple calls here.
If we do put this in, I would suggest adding more text to the description to
explain why this is an expensive operation - that it performs a collect to
determine the size of the dataframe.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]