[jira] [Assigned] (SPARK-51112) [Connect] Seg fault when converting empty dataframe with nested array columns to pandas

Hyukjin Kwon (Jira) Mon, 10 Feb 2025 16:30:28 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-51112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon reassigned SPARK-51112:
------------------------------------

    Assignee: Venkata Sai Akhil Gudesa

> [Connect] Seg fault when converting empty dataframe with nested array columns 
> to pandas
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-51112
>                 URL: https://issues.apache.org/jira/browse/SPARK-51112
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, PySpark
>    Affects Versions: 4.0.0, 4.1.0
>            Reporter: Venkata Sai Akhil Gudesa
>            Assignee: Venkata Sai Akhil Gudesa
>            Priority: Major
>              Labels: pull-request-available
>
> Run the following code with a running local connect server:
> {code:java}
> from pyspark.sql.types import StructField, ArrayType, StringType, StructType, 
> IntegerType
> import faulthandler
> faulthandler.enable()
> spark = SparkSession.builder \
>     .remote("sc://localhost:15002") \
>     .getOrCreate()
> sp_df = spark.createDataFrame(
>     data = [],
>     schema=StructType(
>         [
>             StructField(
>                 name='b_int',
>                 dataType=IntegerType(),
>                 nullable=False,
>             ),
>             StructField(
>                 name='b',
>                 dataType=ArrayType(ArrayType(StringType(), True), True),
>                 nullable=True,
>             ),
>         ]
>     )
> )
> print(sp_df)
> print('Spark dataframe generated.')
> print(sp_df.toPandas())
> print('Pandas dataframe generated.') {code}
> When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg 
> fault is non-deterministic and does not occur every single time.
> Segfault:
> {code:java}
> Thread 0x00000001f1904f40 (most recent call first):
>   File 
> "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyarrow/pandas_compat.py",
>  line 808 in table_to_dataframe
>   File 
> "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/client/core.py",
>  line 949 in to_pandas
>   File 
> "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/dataframe.py",
>  line 1857 in toPandas
>   File "<python-input-3>", line 1 in <module>
>   File 
> "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py",
>  line 92 in runcode
>   File 
> "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/console.py",
>  line 205 in runsource
>   File 
> "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py",
>  line 313 in push
>   File 
> "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/simple_interact.py",
>  line 160 in run_multiline_interactive_console
>   File 
> "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/main.py",
>  line 59 in interactive_console
>   File 
> "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/__main__.py",
>  line 6 in <module>
>   File "<frozen runpy>", line 88 in _run_code{code}
> Observations:
>  * When I added some sample data, the issue went away and the conversion was 
> successfull.
>  * When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to 
> {{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution 
> was successful *regardless of data.*
>  * When I converted the nested array column into a JSON field using 
> {{to_json}} (and dropped the original nested array column) , there was again 
> no seg fault, and execution was successful *regardless of data.*
>  
> Conculsion: There is an issue in pyarrow/pandas that is triggered when 
> converting empty datasets containing nested array columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-51112) [Connect] Seg fault when converting empty dataframe with nested array columns to pandas

Reply via email to