[ https://issues.apache.org/jira/browse/SPARK-51112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon reassigned SPARK-51112: ------------------------------------ Assignee: Venkata Sai Akhil Gudesa > [Connect] Seg fault when converting empty dataframe with nested array columns > to pandas > --------------------------------------------------------------------------------------- > > Key: SPARK-51112 > URL: https://issues.apache.org/jira/browse/SPARK-51112 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark > Affects Versions: 4.0.0, 4.1.0 > Reporter: Venkata Sai Akhil Gudesa > Assignee: Venkata Sai Akhil Gudesa > Priority: Major > Labels: pull-request-available > > Run the following code with a running local connect server: > {code:java} > from pyspark.sql.types import StructField, ArrayType, StringType, StructType, > IntegerType > import faulthandler > faulthandler.enable() > spark = SparkSession.builder \ > .remote("sc://localhost:15002") \ > .getOrCreate() > sp_df = spark.createDataFrame( > data = [], > schema=StructType( > [ > StructField( > name='b_int', > dataType=IntegerType(), > nullable=False, > ), > StructField( > name='b', > dataType=ArrayType(ArrayType(StringType(), True), True), > nullable=True, > ), > ] > ) > ) > print(sp_df) > print('Spark dataframe generated.') > print(sp_df.toPandas()) > print('Pandas dataframe generated.') {code} > When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg > fault is non-deterministic and does not occur every single time. > Segfault: > {code:java} > Thread 0x00000001f1904f40 (most recent call first): > File > "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyarrow/pandas_compat.py", > line 808 in table_to_dataframe > File > "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/client/core.py", > line 949 in to_pandas > File > "/Users/venkata.gudesa/spark/test_env/lib/python3.13/site-packages/pyspark/sql/connect/dataframe.py", > line 1857 in toPandas > File "<python-input-3>", line 1 in <module> > File > "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", > line 92 in runcode > File > "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/console.py", > line 205 in runsource > File > "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/code.py", > line 313 in push > File > "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/simple_interact.py", > line 160 in run_multiline_interactive_console > File > "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/main.py", > line 59 in interactive_console > File > "/opt/homebrew/Cellar/python@3.13/3.13.0_1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/_pyrepl/__main__.py", > line 6 in <module> > File "<frozen runpy>", line 88 in _run_code{code} > Observations: > * When I added some sample data, the issue went away and the conversion was > successfull. > * When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to > {{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution > was successful *regardless of data.* > * When I converted the nested array column into a JSON field using > {{to_json}} (and dropped the original nested array column) , there was again > no seg fault, and execution was successful *regardless of data.* > > Conculsion: There is an issue in pyarrow/pandas that is triggered when > converting empty datasets containing nested array columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org