Re: Writing null structs to parquet

Radu Teodorescu Thu, 30 Jul 2020 09:36:14 -0700

You’re a rock-star - your PR works for my reallife usecase as well - 
unfortunately this squashes my hopes of making my first arrow contribution 
today :)


Now it breaks in supporting a combination of struct and list at read time, but 
that is clearly documented as not yet supported - it there any timeline for 
that? (I can work around it for now, but it would be nice to have at some 
point) … maybe that can be my first contribution given enough time :).


> On Jul 30, 2020, at 9:26 AM, Radu Teodorescu <radukay...@yahoo.com.INVALID> 
> wrote:
> 
> 
> Thank you Micah!
> I spent a bit of time trying to get to the bottom of it (I know parquet 
> pretty well, but not that familiar with arrow parquet inner workings) so if 
> manage to track down the issue I’ll circle back (I give myself a 30% chance 
> of success given the allotted time and expertise level)
> 
>> On Jul 30, 2020, at 12:31 AM, Micah Kornfield <emkornfi...@gmail.com> wrote:
>> 
>> I created https://issues.apache.org/jira/browse/ARROW-9598 to track.
>> 
>> On Wed, Jul 29, 2020 at 9:13 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> 
>>> So I think the problem is within WriteLevelSpaced [1], specifically how we
>>> calculate "min_spaced_def_level", seems incorrect (I think this only worked
>>> for single nested lists).  This value probably needs to be calculated by
>>> walking up the tree to find the def level of the first repeated value.
>>> 
>>> [1]
>>> https://github.com/apache/arrow/blob/3586292d62c8c348e9fb85676eb524cde53179cf/cpp/src/parquet/column_writer.cc#L1141
>>> 
>>> On Wed, Jul 29, 2020 at 8:01 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>> 
>>>> Hi Radu,
>>>> This appears to be a bug, would you mind filing a bug in JIRA?
>>>> 
>>>> I'm looking into it to see if I can figure out what is going on.
>>>> 
>>>> Thanks,
>>>> Micah
>>>> 
>>>> On Wed, Jul 29, 2020 at 1:07 PM Radu Teodorescu
>>>> <radukay...@yahoo.com.invalid> wrote:
>>>> 
>>>>> Is the current version supposed to allow struct columns with null values
>>>>> to be written to parquet:
>>>>> 
>>>>> I narrowed it down to a two rows table with one column and two rows and
>>>>> the resulting parquet file is broken both according to parquet-tools as
>>>>> well as our own reader (it looks like a buffer is not written in full, but
>>>>> I haven’t dug much deeper)
>>>>> 
>>>>> This is the table:
>>>>> 
>>>>> struct: struct<int: int64>
>>>>> child 0, int: int64
>>>>> ----
>>>>> struct:
>>>>> [
>>>>>   -- is_valid:
>>>>>         [
>>>>>       false,
>>>>>       true
>>>>>     ]
>>>>>   -- child 0 type: int64
>>>>>     [
>>>>>       null,
>>>>>       2
>>>>>     ]
>>>>> ]
>>>>> 
>>>>> and this is my repro table generation:
>>>>> 
>>>>> std::shared_ptr<arrow::Table> generate_table2() {
>>>>>   auto i64builder = std::make_shared<arrow::Int64Builder>();
>>>>>   const std::shared_ptr<arrow::DataType> structType =
>>>>> arrow::struct_({arrow::field("int", arrow::int64())});
>>>>>   arrow::StructBuilder structBuilder(structType,
>>>>> arrow::default_memory_pool(), {
>>>>>           std::static_pointer_cast<arrow::ArrayBuilder>(i64builder)});
>>>>>   PARQUET_THROW_NOT_OK(structBuilder.AppendNull());
>>>>>   PARQUET_THROW_NOT_OK(structBuilder.Append());
>>>>>   PARQUET_THROW_NOT_OK(i64builder->Append(2));
>>>>>   std::shared_ptr<arrow::Array> structArray;
>>>>>   PARQUET_THROW_NOT_OK(structBuilder.Finish(&structArray));
>>>>>   std::shared_ptr<arrow::Schema> schema =
>>>>> arrow::schema({arrow::field("struct",structType)});
>>>>>   return arrow::Table::Make(schema, {structArray});
>>>>> }
>>>>> Is this a bug, know limitation or am I doing something dumb?
>>>>> 
>>>>> Thank you
>>>>> Radu
>>>>> 
>>>>> 
>

Re: Writing null structs to parquet

Reply via email to