[ 
https://issues.apache.org/jira/browse/SPARK-50490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927476#comment-17927476
 ] 

Snehal Bhatnagar commented on SPARK-50490:
------------------------------------------

[~faucct], I plan to start contributing here, is there any way I can help?

> DataFrame.toPandas() and DataFrame.mapInPandas() have different data types
> --------------------------------------------------------------------------
>
>                 Key: SPARK-50490
>                 URL: https://issues.apache.org/jira/browse/SPARK-50490
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.5.3
>            Reporter: Nikita Sokolov
>            Priority: Major
>
> When using .toPandas() the ints are list([0, 1]), but when using 
> .mapInPandas() they are numpy.ndarray([0, 1]):
> {code:java}
> from pandas import DataFrame
> from pyspark.sql import SparkSession
> from pyspark.sql.types import Row, StructType, StringType
> def stringify(data_frame: DataFrame) -> DataFrame:
>     return data_frame.map(lambda record: repr((type(record), record)))
> spark = SparkSession.Builder().getOrCreate()
> spark_data_frame = spark.createDataFrame([Row(ints=[row, row + 1]) for row in 
> range(1)])
> print(stringify(spark_data_frame.toPandas()), spark_data_frame.mapInPandas(
>     lambda data_frames: (stringify(data_frame) for data_frame in data_frames),
>     schema=StructType().add('ints', StringType()),
> ).toPandas()) {code}
> {code:java}
>                        ints
> 0  (<class 'list'>, [0, 1])
>                        ints
> 0  (<class 'numpy.ndarray'>, array([0, 1]))
> {code}
> I have noticed this while trying to sort the dataframe by this column. Pandas 
> fails to sort such dataframe when using .mapInPandas(), but succeeds when 
> using .toPandas():
> {code:java}
> from pandas import DataFrame
> from pyspark.sql import SparkSession
> from pyspark.sql.types import Row, StructType, StringType
> def stringify(data_frame: DataFrame) -> DataFrame:
>     return data_frame.sort_values('ints').map(lambda record: 
> repr((type(record), record)))
> spark = SparkSession.Builder().getOrCreate()
> spark_data_frame = spark.createDataFrame([Row(ints=[row, row + 1]) for row in 
> range(10000)])
> print(stringify(spark_data_frame.toPandas()), spark_data_frame.mapInPandas(
>     lambda data_frames: (stringify(data_frame) for data_frame in data_frames),
>     schema=StructType().add('ints', StringType()),
> ).toPandas()) {code}
> {code:java}
>   File "/usr/local/lib/python3.12/site-packages/pandas/core/frame.py", line 
> 7200, in sort_values
>     indexer = nargsort(
>               ^^^^^^^^^
>   File "/usr/local/lib/python3.12/site-packages/pandas/core/sorting.py", line 
> 439, in nargsort
>     indexer = non_nan_idx[non_nans.argsort(kind=kind)]
>                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ValueError: The truth value of an array with more than one element is 
> ambiguous. Use a.any() or a.all() {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to