Re: Creating column of type LIST from parquet-cpp

Uwe L. Korn Sun, 10 Dec 2017 03:33:34 -0800

The C++ project is located at https://github.com/apache/parquet-cpp,
especially look out for the Arrow API at
https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow


Uwe

On Sat, Dec 9, 2017, at 11:25 PM, Renato Marroquín Mogrovejo wrote:
> And by the way, do you have a link for the C++ Parquet API by any chance?
> I have been going over this https://github.com/apache/parquet-mr but only
> java code so far.
> 
> 2017-12-09 23:21 GMT+01:00 Renato Marroquín Mogrovejo <
> renatoj.marroq...@gmail.com>:
> 
> > yeah, I don't mind looking at the code, but the problem is finding the
> > right code ;)
> > I haven't found any test cases for impala to read/write specific data
> > formats, maybe I will ping the mailing list.
> > Regarding parquet::arrow API, do you have link from github I could chase?
> > I wouldn't mind writing some documentation/examples for the project and
> > make it more approachable for more people :)
> > Many thanks again Wes!
> >
> > 2017-12-09 22:25 GMT+01:00 Wes McKinney <wesmck...@gmail.com>:
> >
> >> I found this comment in Apache Impala helpful, I'm not sure what
> >> better resources are out there outside reading Parquet
> >> implementations:
> >>
> >> https://github.com/apache/impala/blob/master/be/src/exec/
> >> hdfs-parquet-scanner.h#L80
> >>
> >> For the parquet::arrow API, you will want to read the header files.
> >> There's some overhead to using the Arrow-based writer API, but I
> >> suspect the overhead is small relative to the other parts of producing
> >> Parquet files.
> >>
> >> - Wes
> >>
> >> On Sat, Dec 9, 2017 at 3:15 PM, Renato Marroquín Mogrovejo
> >> <renatoj.marroq...@gmail.com> wrote:
> >> > Hi Wes,
> >> >
> >> > Thanks a lot for your help! I have been looking at that blog the last
> >> > couple of days but I haven't been able to achieve what I want :(
> >> > Do you know if there is there any actual documentation, test cases or
> >> some
> >> > code I can look at?
> >> > Anyway, this is what I have so far:
> >> > parquet::Int32Writer* int32_writer1 =
> >> > static_cast<parquet::Int32Writer*>(rg_writer->NextColumn());
> >> > int32_t value = 1;
> >> > value = 1000;
> >> > int16_t definition_level = 2;
> >> > int16_t repetition_level = 0;
> >> > int32_writer1->WriteBatch(1, &definition_level, &repetition_level,
> >> &value);
> >> >
> >> > int16_t rpl = 1;
> >> > int32_writer1->WriteBatch(1, &definition_level, &rpl, &value);
> >> >
> >> > This works better (using the parquet reader doesn't yield into reading
> >> NULL
> >> > values), but I still can't read the resulting parquet file from
> >> > Presto/Athena.
> >> > I would like to have as final result when queries from Presto/Athena:
> >> > id          my_array
> >> > 1           array[1000, 1000]
> >> >
> >> > What I currently get is
> >> > id          my_array
> >> > 1
> >> >
> >> > Regarding using parquet::arrow API, is there any docs? that I can look
> >> to
> >> > get me started? Also, is there any performance penalties by using
> >> > parquet::arrow instead of the parquet lower api?
> >> >
> >> > 2017-12-09 1:13 GMT+01:00 Wes McKinney <wesmck...@gmail.com>:
> >> >
> >> >> Didn't realize this question was on the Arrow mailing list instead of
> >> >> the Parquet mailing list!
> >> >>
> >> >> You can make things much easier on yourself by putting your data in
> >> >> Arrow arrays and using the parquet::arrow APIs.
> >> >>
> >> >> If you want to write the data using the lower-level Parquet column
> >> >> writer API, you will have to be careful with the repetition/definition
> >> >> levels. In your case, I believe the values you write need to have
> >> >> definition level 2 (the repeated node and optional node both increment
> >> >> the definition level by 1).
> >> >>
> >> >> I find this blog helpful for this
> >> >> https://blog.twitter.com/engineering/en_us/a/2013/dremel-
> >> made-simple-with-
> >> >> parquet.html.
> >> >> There is also the Google Dremel paper
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Fri, Dec 8, 2017 at 6:19 PM, Renato Marroquín Mogrovejo
> >> >> <renatoj.marroq...@gmail.com> wrote:
> >> >> > Thanks Wes! So I create it this way, but I still don't know how to
> >> >> populate
> >> >> > and
> >> >> >
> >> >> > auto element = PrimitiveNode::Make("element", Repetition::OPTIONAL,
> >> >> > Type::INT32);
> >> >> > auto list = GroupNode::Make("list", Repetition::REPEATED, {element});
> >> >> > auto my_array = GroupNode::Make("my_array", Repetition::REQUIRED,
> >> {list},
> >> >> > LogicalType::LIST);
> >> >> > fields.push_back(PrimitiveNode::Make("id", Repetition::REQUIRED,
> >> >> > Type::INT32, LogicalType::NONE));
> >> >> > fields.push_back(my_array);
> >> >> > auto my_schema = GroupNode::Make("schema", Repetition::REQUIRED,
> >> fields);
> >> >> >
> >> >> > I tried populating it this way:
> >> >> >
> >> >> >        parquet::Int32Writer* int32_writer1 =
> >> >> > static_cast<parquet::Int32Writer*>(rg_writer->NextColumn());
> >> >> >        for (int i = 0; i < NROWS_GROUP; i++) {
> >> >> >          int32_t value = i;
> >> >> >          int16_t definition_level = 1;
> >> >> >          int16_t repetition_level = 0;
> >> >> >          if ((i+1)%2 == 0) {
> >> >> >            repetition_level = 1;  // start of a new record
> >> >> >          }
> >> >> >          int32_writer1->WriteBatch(1, &definition_level,
> >> >> &repetition_level,
> >> >> > &value);
> >> >> >       }
> >> >> >
> >> >> > That seems to work, but I can't use the generated file on Athena and
> >> >> using
> >> >> > the parquet_reader from parquet_cpp returns NULLs on the elements.
> >> Is it
> >> >> > that I have to get a handle to the list element? Thanks again for the
> >> >> help!
> >> >> >
> >> >>
> >>
> >
> >

Re: Creating column of type LIST from parquet-cpp

Reply via email to