There isn’t really a need to read parquet into arrow tables. You can just use 
pyarrow to read row groups from the smaller files and write them to a new file 
using pyarrow.parquet.ParquetFile
________________________________
From: Karthik Deivasigamani via user <user@arrow.apache.org>
Sent: Wednesday, October 9, 2024 6:39:59 AM
To: user@arrow.apache.org <user@arrow.apache.org>
Subject: PyArrow <-> Pandas Timestamp Conversion Error


External Email: Use caution with links and attachments


Hi,
   I have a simple usecase of merging data from multiple parquet file into a 
single file. Usually I'm dealing with 50 files of size 100k and trying to form 
a single parquet file. The code looks something like this :

dfs = []
full_schema = None
for s3_url in s3_urls:
  table = ds.dataset(s3_url, format="parquet").to_table()
  dfs.append(table.to_pandas(safe=False))
  full_schema = merge_schema(full_schema, table.schema) ## we keep merging any 
new columns that appear in the parquet file
dfs = pd.concat([])
df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates
Table.from_pandas(df, nthreads=1, schema=table.schema)

All the above code does is read files from s3 converts them to an Table and 
then gets the schema, converts table to dataframe and then concats the 
dataframes.
The problem I notice is that one of the columns is a timestamp field and while 
converting back from pandas dataframe to Arrow Table I encounter the following 
error

"Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with type 
Timestamp: tried to convert to int64", 'Conversion failed for column 
datetime_end_time_199 with type object'

>From my understanding of parquet the Timestamp is a logical datatype while the 
>underlying primitive is still int64. In this case why is the column being cast 
>to an object? What am I missing here?
Any help is really appreciated. Thanks

~
Karthik

This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2024 BlackRock, Inc. All rights reserved.

Reply via email to