Re: PyArrow <-> Pandas Timestamp Conversion Error

Karthik Deivasigamani via user Fri, 25 Oct 2024 01:25:58 -0700

Thanks David for the inputs.

On Thu, Oct 24, 2024 at 8:36 PM Lee, David (ITE) <david....@blackrock.com>
wrote:


> Pandas by default treats timestamps as python objects. to_pandas() has
> this option below.
>
> *date_as_object*bool
> <https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values>,
> default True <https://docs.python.org/3/library/constants.html#True>
>
> Cast dates to objects. If False, convert to datetime64 dtype with the
> equivalent time unit (if supported). Note: in pandas version < 2.0, only
> datetime64[ns] conversion is supported.
> But using pyarrow.parquet for this task is pretty instantaneous with zero
> memory overhead since you’re just copying chunks from one file and
> appending it to end of another..
>
>
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>
> ------------------------------
> *From:* Lee, David (ITE) <david....@blackrock.com>
> *Sent:* Thursday, October 24, 2024 7:55:59 AM
> *To:* user@arrow.apache.org <user@arrow.apache.org>; Karthik
> Deivasigamani <karthik.deivasigam...@moengage.com>
> *Subject:* Re: PyArrow <-> Pandas Timestamp Conversion Error
>
>
> There isn’t really a need to read parquet into arrow tables. You can just
> use pyarrow to read row groups from the smaller files and write them to a
> new file using pyarrow.parquet.ParquetFile
> ------------------------------
> *From:* Karthik Deivasigamani via user <user@arrow.apache.org>
> *Sent:* Wednesday, October 9, 2024 6:39:59 AM
> *To:* user@arrow.apache.org <user@arrow.apache.org>
> *Subject:* PyArrow <-> Pandas Timestamp Conversion Error
>
>
> External Email: Use caution with links and attachments
>
> Hi,
>    I have a simple usecase of merging data from multiple parquet file into
> a single file. Usually I'm dealing with 50 files of size 100k and trying to
> form a single parquet file. The code looks something like this :
>
> dfs = []
> full_schema = None
> for s3_url in s3_urls:
>   table = ds.dataset(s3_url, format="parquet").to_table()
>   dfs.append(table.to_pandas(safe=False))
>   full_schema = merge_schema(full_schema, table.schema) ## we keep merging
> any new columns that appear in the parquet file
> dfs = pd.concat([])
> df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates
> Table.from_pandas(df, nthreads=1, schema=table.schema)
>
> All the above code does is read files from s3 converts them to an Table
> and then gets the schema, converts table to dataframe and then concats the
> dataframes.
> The problem I notice is that one of the columns is a timestamp field and
> while converting back from pandas dataframe to Arrow Table I encounter the
> following error
>
> "Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with
> type Timestamp: tried to convert to int64", 'Conversion failed for column
> datetime_end_time_199 with type object'
>
> From my understanding of parquet the Timestamp is a logical datatype while
> the underlying primitive is still int64. In this case why is the column
> being cast to an object? What am I missing here?
> Any help is really appreciated. Thanks
>
> ~
> Karthik
>
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2024 BlackRock, Inc. All rights reserved.
>
>

Re: PyArrow <-> Pandas Timestamp Conversion Error

Reply via email to