[ https://issues.apache.org/jira/browse/SPARK-50490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927476#comment-17927476 ]
Snehal Bhatnagar commented on SPARK-50490: ------------------------------------------ [~faucct], I plan to start contributing here, is there any way I can help? > DataFrame.toPandas() and DataFrame.mapInPandas() have different data types > -------------------------------------------------------------------------- > > Key: SPARK-50490 > URL: https://issues.apache.org/jira/browse/SPARK-50490 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.5.3 > Reporter: Nikita Sokolov > Priority: Major > > When using .toPandas() the ints are list([0, 1]), but when using > .mapInPandas() they are numpy.ndarray([0, 1]): > {code:java} > from pandas import DataFrame > from pyspark.sql import SparkSession > from pyspark.sql.types import Row, StructType, StringType > def stringify(data_frame: DataFrame) -> DataFrame: > return data_frame.map(lambda record: repr((type(record), record))) > spark = SparkSession.Builder().getOrCreate() > spark_data_frame = spark.createDataFrame([Row(ints=[row, row + 1]) for row in > range(1)]) > print(stringify(spark_data_frame.toPandas()), spark_data_frame.mapInPandas( > lambda data_frames: (stringify(data_frame) for data_frame in data_frames), > schema=StructType().add('ints', StringType()), > ).toPandas()) {code} > {code:java} > ints > 0 (<class 'list'>, [0, 1]) > ints > 0 (<class 'numpy.ndarray'>, array([0, 1])) > {code} > I have noticed this while trying to sort the dataframe by this column. Pandas > fails to sort such dataframe when using .mapInPandas(), but succeeds when > using .toPandas(): > {code:java} > from pandas import DataFrame > from pyspark.sql import SparkSession > from pyspark.sql.types import Row, StructType, StringType > def stringify(data_frame: DataFrame) -> DataFrame: > return data_frame.sort_values('ints').map(lambda record: > repr((type(record), record))) > spark = SparkSession.Builder().getOrCreate() > spark_data_frame = spark.createDataFrame([Row(ints=[row, row + 1]) for row in > range(10000)]) > print(stringify(spark_data_frame.toPandas()), spark_data_frame.mapInPandas( > lambda data_frames: (stringify(data_frame) for data_frame in data_frames), > schema=StructType().add('ints', StringType()), > ).toPandas()) {code} > {code:java} > File "/usr/local/lib/python3.12/site-packages/pandas/core/frame.py", line > 7200, in sort_values > indexer = nargsort( > ^^^^^^^^^ > File "/usr/local/lib/python3.12/site-packages/pandas/core/sorting.py", line > 439, in nargsort > indexer = non_nan_idx[non_nans.argsort(kind=kind)] > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > ValueError: The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all() {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org