Thanks Sergio. If we don't have any unit tests explicitly testing this, it would be a good idea to add some anyway.
- Wes On Fri, May 4, 2018 at 12:26 PM, <scarrasc...@ravenpack.com> wrote: > Hi Uwe: > > Thanks a lot for your feedback. > > While preparing a simple example to reproduce this issue, I have been able to > get the expected behavior (empty strings properly written as ‘’ in the > parquet file). > So actually there’s no problem with the Parquet.write_table > > The problem was rather in a bug whereas two steps in my process were in the > wrong order, so None values were being applied unicode formatting earlier > than expected, thus becoming ‘None’. > > Again, thank you very much and apologies for the noise. > > Best, > > Sergio Carrascoso > >> On 4 May 2018, at 10:54, Uwe L. Korn <uw...@xhochy.com> wrote: >> >> Hello Sergio, >> >> this is definitely unwanted behaviour. Can you open an issue on >> https://issues.apache.org/jira/projects/PARQUET and provide a minimal >> reproducing example. There is definitely a difference between empty strings >> and null strings. Parquet also supports the differentiation thus we should >> support roundtripping them. >> >> Uwe >> >> On Thu, May 3, 2018, at 8:47 AM, scarrasc...@ravenpack.com wrote: >>> >>> Hi: >>> >>> I would like to know if there is any way in PyArrow to write empty >>> string values to a parquet file. >>> When I use Parquet.write_table, if any column contains empty string >>> values, they end up as None in the parquet file. >>> My process depends on these values to be properly written as empty >>> strings in the parquet files. >>> >>> To provide some context, my current worflow is the following: >>> >>> - Read content from json files (using Pandas.read_json) >>> - Convert the corresponding dataframe to a PyArrow table (using >>> PyArrow.Table.from_pandas) >>> - Finally, write the table to a parquet file (using Parquet.write_table) >>> >>> I have done some checks during the process, and the empty string values >>> are being honored until the writing step to a parquet file. >>> >>> The options for the write_table method don't provide any specific for >>> this, is this behavior (write '' as None) an unavoidable default? >>> Is there any other way to write the parquet files where I have more >>> options to deal with this? >>> >>> Any hint or feedback will be greatly appreciated. >>> >>> Thanks a lot in advance, all the best. >>> >>> Sergio Carrascoso >>> >