I found this comment in Apache Impala helpful, I'm not sure what better resources are out there outside reading Parquet implementations:
https://github.com/apache/impala/blob/master/be/src/exec/hdfs-parquet-scanner.h#L80 For the parquet::arrow API, you will want to read the header files. There's some overhead to using the Arrow-based writer API, but I suspect the overhead is small relative to the other parts of producing Parquet files. - Wes On Sat, Dec 9, 2017 at 3:15 PM, Renato Marroquín Mogrovejo <renatoj.marroq...@gmail.com> wrote: > Hi Wes, > > Thanks a lot for your help! I have been looking at that blog the last > couple of days but I haven't been able to achieve what I want :( > Do you know if there is there any actual documentation, test cases or some > code I can look at? > Anyway, this is what I have so far: > parquet::Int32Writer* int32_writer1 = > static_cast<parquet::Int32Writer*>(rg_writer->NextColumn()); > int32_t value = 1; > value = 1000; > int16_t definition_level = 2; > int16_t repetition_level = 0; > int32_writer1->WriteBatch(1, &definition_level, &repetition_level, &value); > > int16_t rpl = 1; > int32_writer1->WriteBatch(1, &definition_level, &rpl, &value); > > This works better (using the parquet reader doesn't yield into reading NULL > values), but I still can't read the resulting parquet file from > Presto/Athena. > I would like to have as final result when queries from Presto/Athena: > id my_array > 1 array[1000, 1000] > > What I currently get is > id my_array > 1 > > Regarding using parquet::arrow API, is there any docs? that I can look to > get me started? Also, is there any performance penalties by using > parquet::arrow instead of the parquet lower api? > > 2017-12-09 1:13 GMT+01:00 Wes McKinney <wesmck...@gmail.com>: > >> Didn't realize this question was on the Arrow mailing list instead of >> the Parquet mailing list! >> >> You can make things much easier on yourself by putting your data in >> Arrow arrays and using the parquet::arrow APIs. >> >> If you want to write the data using the lower-level Parquet column >> writer API, you will have to be careful with the repetition/definition >> levels. In your case, I believe the values you write need to have >> definition level 2 (the repeated node and optional node both increment >> the definition level by 1). >> >> I find this blog helpful for this >> https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with- >> parquet.html. >> There is also the Google Dremel paper >> >> - Wes >> >> On Fri, Dec 8, 2017 at 6:19 PM, Renato Marroquín Mogrovejo >> <renatoj.marroq...@gmail.com> wrote: >> > Thanks Wes! So I create it this way, but I still don't know how to >> populate >> > and >> > >> > auto element = PrimitiveNode::Make("element", Repetition::OPTIONAL, >> > Type::INT32); >> > auto list = GroupNode::Make("list", Repetition::REPEATED, {element}); >> > auto my_array = GroupNode::Make("my_array", Repetition::REQUIRED, {list}, >> > LogicalType::LIST); >> > fields.push_back(PrimitiveNode::Make("id", Repetition::REQUIRED, >> > Type::INT32, LogicalType::NONE)); >> > fields.push_back(my_array); >> > auto my_schema = GroupNode::Make("schema", Repetition::REQUIRED, fields); >> > >> > I tried populating it this way: >> > >> > parquet::Int32Writer* int32_writer1 = >> > static_cast<parquet::Int32Writer*>(rg_writer->NextColumn()); >> > for (int i = 0; i < NROWS_GROUP; i++) { >> > int32_t value = i; >> > int16_t definition_level = 1; >> > int16_t repetition_level = 0; >> > if ((i+1)%2 == 0) { >> > repetition_level = 1; // start of a new record >> > } >> > int32_writer1->WriteBatch(1, &definition_level, >> &repetition_level, >> > &value); >> > } >> > >> > That seems to work, but I can't use the generated file on Athena and >> using >> > the parquet_reader from parquet_cpp returns NULLs on the elements. Is it >> > that I have to get a handle to the list element? Thanks again for the >> help! >> > >>