There isn’t really a need to read parquet into arrow tables. You can just use pyarrow to read row groups from the smaller files and write them to a new file using pyarrow.parquet.ParquetFile ________________________________ From: Karthik Deivasigamani via user <user@arrow.apache.org> Sent: Wednesday, October 9, 2024 6:39:59 AM To: user@arrow.apache.org <user@arrow.apache.org> Subject: PyArrow <-> Pandas Timestamp Conversion Error
External Email: Use caution with links and attachments Hi, I have a simple usecase of merging data from multiple parquet file into a single file. Usually I'm dealing with 50 files of size 100k and trying to form a single parquet file. The code looks something like this : dfs = [] full_schema = None for s3_url in s3_urls: table = ds.dataset(s3_url, format="parquet").to_table() dfs.append(table.to_pandas(safe=False)) full_schema = merge_schema(full_schema, table.schema) ## we keep merging any new columns that appear in the parquet file dfs = pd.concat([]) df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates Table.from_pandas(df, nthreads=1, schema=table.schema) All the above code does is read files from s3 converts them to an Table and then gets the schema, converts table to dataframe and then concats the dataframes. The problem I notice is that one of the columns is a timestamp field and while converting back from pandas dataframe to Arrow Table I encounter the following error "Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with type Timestamp: tried to convert to int64", 'Conversion failed for column datetime_end_time_199 with type object' >From my understanding of parquet the Timestamp is a logical datatype while the >underlying primitive is still int64. In this case why is the column being cast >to an object? What am I missing here? Any help is really appreciated. Thanks ~ Karthik This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2024 BlackRock, Inc. All rights reserved.