Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Joris Van den Bossche Wed, 09 Jun 2021 00:29:34 -0700

That won't help in this specific case, since it is for an array of
strings (which you can't fill with NaN), and for floating point
arrays, we already use np.nan as "null" representation when converting
to numpy/pandas.


On Wed, 9 Jun 2021 at 03:37, Benjamin Kietzman <bengil...@gmail.com> wrote:
>
> As a workaround, the "fill_null" compute function can be used to replace
> nulls with nans:
>
> >>> nan = pa.scalar(np.NaN, type=pa.float64())
> >>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()
>
> On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > Hi Li,
> >
> > It's correct that arrow uses "None" for null values when converting a
> > string array to numpy / pandas.
> > As far as I am aware, there is currently no option to control that
> > (and to make it use np.nan instead), and I am not sure there would be
> > much interest in adding such an option.
> >
> > Now, I know this doesn't give an exact roundtrip in this case, but
> > pandas does treat both np.nan and None as missing values in object
> > dtype columns, so behaviour-wise this shouldn't give any difference
> > and the roundtrip is still faithful on that aspect.
> >
> > Best,
> > Joris
> >
> > On Tue, 8 Jun 2021 at 21:59, Li Jin <ice.xell...@gmail.com> wrote:
> > >
> > > Hello!
> > >
> > > Apologies if this has been brought before. I'd like to get devs' thoughts
> > > on this potential inconsistency of "what are the python objects for null
> > > values" between pandas and pyarrow.
> > >
> > > Demonstrated with the following example:
> > >
> > > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> > pandas
> > > 1.2.4):
> > >
> > > In [*32*]: df
> > >
> > > Out[*32*]:
> > >
> > >            value
> > >
> > > key
> > >
> > > 1    some_strign
> > >
> > >
> > > In [*33*]: df2
> > >
> > > Out[*33*]:
> > >
> > >                 value2
> > >
> > > key
> > >
> > > 2    some_other_string
> > >
> > >
> > > In [*34*]: df.join(df2)
> > >
> > > Out[*34*]:
> > >
> > >            value value2
> > >
> > > key
> > >
> > > 1    some_strign    *NaN*
> > >
> > >
> > >
> > > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> > >
> > > >>> s = pd.Series(["some_string", np.NaN])
> > >
> > > >>> s
> > >
> > > 0    some_string
> > >
> > > 1            NaN
> > >
> > > dtype: object
> > >
> > > >>> pa.Array.from_pandas(s).to_pandas()
> > >
> > > 0    some_string
> > >
> > > 1           None
> > >
> > > dtype: object
> > >
> > >
> > > I have looked around the pyarrow doc and didn't find an option to use
> > > np.NaN for null values with to_pandas so it's a bit hard to get around
> > trip
> > > consistency.
> > >
> > >
> > > I appreciate any thoughts on this as to how to achieve consistency here.
> > >
> > >
> > > Thanks!
> > >
> > > Li
> >

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Reply via email to