Thanks David for the inputs. On Thu, Oct 24, 2024 at 8:36 PM Lee, David (ITE) <david....@blackrock.com> wrote:
> Pandas by default treats timestamps as python objects. to_pandas() has > this option below. > > *date_as_object*bool > <https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values>, > default True <https://docs.python.org/3/library/constants.html#True> > > Cast dates to objects. If False, convert to datetime64 dtype with the > equivalent time unit (if supported). Note: in pandas version < 2.0, only > datetime64[ns] conversion is supported. > But using pyarrow.parquet for this task is pretty instantaneous with zero > memory overhead since you’re just copying chunks from one file and > appending it to end of another.. > > > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > > ------------------------------ > *From:* Lee, David (ITE) <david....@blackrock.com> > *Sent:* Thursday, October 24, 2024 7:55:59 AM > *To:* user@arrow.apache.org <user@arrow.apache.org>; Karthik > Deivasigamani <karthik.deivasigam...@moengage.com> > *Subject:* Re: PyArrow <-> Pandas Timestamp Conversion Error > > > There isn’t really a need to read parquet into arrow tables. You can just > use pyarrow to read row groups from the smaller files and write them to a > new file using pyarrow.parquet.ParquetFile > ------------------------------ > *From:* Karthik Deivasigamani via user <user@arrow.apache.org> > *Sent:* Wednesday, October 9, 2024 6:39:59 AM > *To:* user@arrow.apache.org <user@arrow.apache.org> > *Subject:* PyArrow <-> Pandas Timestamp Conversion Error > > > External Email: Use caution with links and attachments > > Hi, > I have a simple usecase of merging data from multiple parquet file into > a single file. Usually I'm dealing with 50 files of size 100k and trying to > form a single parquet file. The code looks something like this : > > dfs = [] > full_schema = None > for s3_url in s3_urls: > table = ds.dataset(s3_url, format="parquet").to_table() > dfs.append(table.to_pandas(safe=False)) > full_schema = merge_schema(full_schema, table.schema) ## we keep merging > any new columns that appear in the parquet file > dfs = pd.concat([]) > df.drop_duplicates(inplace=True, subset=["id"]) ## drop any duplicates > Table.from_pandas(df, nthreads=1, schema=table.schema) > > All the above code does is read files from s3 converts them to an Table > and then gets the schema, converts table to dataframe and then concats the > dataframes. > The problem I notice is that one of the columns is a timestamp field and > while converting back from pandas dataframe to Arrow Table I encounter the > following error > > "Could not convert Timestamp('2023-02-12 18:19:25+0000', tz='UTC') with > type Timestamp: tried to convert to int64", 'Conversion failed for column > datetime_end_time_199 with type object' > > From my understanding of parquet the Timestamp is a logical datatype while > the underlying primitive is still int64. In this case why is the column > being cast to an object? What am I missing here? > Any help is really appreciated. Thanks > > ~ > Karthik > > > > This message may contain information that is confidential or privileged. > If you are not the intended recipient, please advise the sender immediately > and delete this message. See > http://www.blackrock.com/corporate/compliance/email-disclaimers for > further information. Please refer to > http://www.blackrock.com/corporate/compliance/privacy-policy for more > information about BlackRock’s Privacy Policy. > > > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/about-us/contacts-locations. > > © 2024 BlackRock, Inc. All rights reserved. > >