You’re a rock-star - your PR works for my reallife usecase as well - unfortunately this squashes my hopes of making my first arrow contribution today :)
Now it breaks in supporting a combination of struct and list at read time, but that is clearly documented as not yet supported - it there any timeline for that? (I can work around it for now, but it would be nice to have at some point) … maybe that can be my first contribution given enough time :). > On Jul 30, 2020, at 9:26 AM, Radu Teodorescu <radukay...@yahoo.com.INVALID> > wrote: > > > Thank you Micah! > I spent a bit of time trying to get to the bottom of it (I know parquet > pretty well, but not that familiar with arrow parquet inner workings) so if > manage to track down the issue I’ll circle back (I give myself a 30% chance > of success given the allotted time and expertise level) > >> On Jul 30, 2020, at 12:31 AM, Micah Kornfield <emkornfi...@gmail.com> wrote: >> >> I created https://issues.apache.org/jira/browse/ARROW-9598 to track. >> >> On Wed, Jul 29, 2020 at 9:13 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> So I think the problem is within WriteLevelSpaced [1], specifically how we >>> calculate "min_spaced_def_level", seems incorrect (I think this only worked >>> for single nested lists). This value probably needs to be calculated by >>> walking up the tree to find the def level of the first repeated value. >>> >>> [1] >>> https://github.com/apache/arrow/blob/3586292d62c8c348e9fb85676eb524cde53179cf/cpp/src/parquet/column_writer.cc#L1141 >>> >>> On Wed, Jul 29, 2020 at 8:01 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> Hi Radu, >>>> This appears to be a bug, would you mind filing a bug in JIRA? >>>> >>>> I'm looking into it to see if I can figure out what is going on. >>>> >>>> Thanks, >>>> Micah >>>> >>>> On Wed, Jul 29, 2020 at 1:07 PM Radu Teodorescu >>>> <radukay...@yahoo.com.invalid> wrote: >>>> >>>>> Is the current version supposed to allow struct columns with null values >>>>> to be written to parquet: >>>>> >>>>> I narrowed it down to a two rows table with one column and two rows and >>>>> the resulting parquet file is broken both according to parquet-tools as >>>>> well as our own reader (it looks like a buffer is not written in full, but >>>>> I haven’t dug much deeper) >>>>> >>>>> This is the table: >>>>> >>>>> struct: struct<int: int64> >>>>> child 0, int: int64 >>>>> ---- >>>>> struct: >>>>> [ >>>>> -- is_valid: >>>>> [ >>>>> false, >>>>> true >>>>> ] >>>>> -- child 0 type: int64 >>>>> [ >>>>> null, >>>>> 2 >>>>> ] >>>>> ] >>>>> >>>>> and this is my repro table generation: >>>>> >>>>> std::shared_ptr<arrow::Table> generate_table2() { >>>>> auto i64builder = std::make_shared<arrow::Int64Builder>(); >>>>> const std::shared_ptr<arrow::DataType> structType = >>>>> arrow::struct_({arrow::field("int", arrow::int64())}); >>>>> arrow::StructBuilder structBuilder(structType, >>>>> arrow::default_memory_pool(), { >>>>> std::static_pointer_cast<arrow::ArrayBuilder>(i64builder)}); >>>>> PARQUET_THROW_NOT_OK(structBuilder.AppendNull()); >>>>> PARQUET_THROW_NOT_OK(structBuilder.Append()); >>>>> PARQUET_THROW_NOT_OK(i64builder->Append(2)); >>>>> std::shared_ptr<arrow::Array> structArray; >>>>> PARQUET_THROW_NOT_OK(structBuilder.Finish(&structArray)); >>>>> std::shared_ptr<arrow::Schema> schema = >>>>> arrow::schema({arrow::field("struct",structType)}); >>>>> return arrow::Table::Make(schema, {structArray}); >>>>> } >>>>> Is this a bug, know limitation or am I doing something dumb? >>>>> >>>>> Thank you >>>>> Radu >>>>> >>>>> >