HeartSaVioR opened a new pull request, #52479:
URL: https://github.com/apache/spark/pull/52479

   ### What changes were proposed in this pull request?
   
   This PR proposes to remove the usage of fetchWithArrow in 
ListState.put/appendList.
   
   ### Why are the changes needed?
   
   We have observed the case where Arrow path of sending the list has some 
issue, while normal path does not have an issue.
   
   The case is to have `None` value in IntegerType() in the element of list 
state - the column is set to nullable=True hence that should be allowed, but 
the error is raised during the conversion.
   
   ```
     File 
"/databricks/spark/python/pyspark/sql/streaming/stateful_processor.py", line 
147, in put
       self._listStateClient.put(self._stateName, newState)
     File 
"/databricks/spark/python/pyspark/sql/streaming/list_state_client.py", line 
195, in put
       self._stateful_processor_api_client._send_arrow_state(self.schema, 
values)
     File 
"/spark/python/pyspark/sql/streaming/stateful_processor_api_client.py", line 
604, in _send_arrow_state
       pandas_df = convert_pandas_using_numpy_type(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/spark/python/pyspark/sql/pandas/types.py", line 1599, in 
convert_pandas_using_numpy_type
       df[field.name] = df[field.name].astype(np_type)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/python/lib/python3.12/site-packages/pandas/core/generic.py", line 
6643, in astype
       new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/python/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 
430, in astype
       return self.apply(
              ^^^^^^^^^^^
     File 
"/python/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 
363, in apply
       applied = getattr(b, f)(**kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/python/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 
758, in astype
       new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", 
line 237, in astype_array_safe
       new_values = astype_array(values, dtype, copy=copy)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", 
line 182, in astype_array
       values = _astype_nansafe(values, dtype, copy=copy)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", 
line 133, in _astype_nansafe
       return arr.astype(dtype, copy=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   TypeError: int() argument must be a string, a bytes-like object or a real 
number, not 'NoneType' 
   ```
   
   Since we don't know how useful the Arrow based sending list is, it'd be 
better not to try to fix the issue in the Arrow code path at this point and 
just remove it.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Updated the existing test to test the observed case.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to