The C++ project is located at https://github.com/apache/parquet-cpp, especially look out for the Arrow API at https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow
Uwe On Sat, Dec 9, 2017, at 11:25 PM, Renato Marroquín Mogrovejo wrote: > And by the way, do you have a link for the C++ Parquet API by any chance? > I have been going over this https://github.com/apache/parquet-mr but only > java code so far. > > 2017-12-09 23:21 GMT+01:00 Renato Marroquín Mogrovejo < > renatoj.marroq...@gmail.com>: > > > yeah, I don't mind looking at the code, but the problem is finding the > > right code ;) > > I haven't found any test cases for impala to read/write specific data > > formats, maybe I will ping the mailing list. > > Regarding parquet::arrow API, do you have link from github I could chase? > > I wouldn't mind writing some documentation/examples for the project and > > make it more approachable for more people :) > > Many thanks again Wes! > > > > 2017-12-09 22:25 GMT+01:00 Wes McKinney <wesmck...@gmail.com>: > > > >> I found this comment in Apache Impala helpful, I'm not sure what > >> better resources are out there outside reading Parquet > >> implementations: > >> > >> https://github.com/apache/impala/blob/master/be/src/exec/ > >> hdfs-parquet-scanner.h#L80 > >> > >> For the parquet::arrow API, you will want to read the header files. > >> There's some overhead to using the Arrow-based writer API, but I > >> suspect the overhead is small relative to the other parts of producing > >> Parquet files. > >> > >> - Wes > >> > >> On Sat, Dec 9, 2017 at 3:15 PM, Renato Marroquín Mogrovejo > >> <renatoj.marroq...@gmail.com> wrote: > >> > Hi Wes, > >> > > >> > Thanks a lot for your help! I have been looking at that blog the last > >> > couple of days but I haven't been able to achieve what I want :( > >> > Do you know if there is there any actual documentation, test cases or > >> some > >> > code I can look at? > >> > Anyway, this is what I have so far: > >> > parquet::Int32Writer* int32_writer1 = > >> > static_cast<parquet::Int32Writer*>(rg_writer->NextColumn()); > >> > int32_t value = 1; > >> > value = 1000; > >> > int16_t definition_level = 2; > >> > int16_t repetition_level = 0; > >> > int32_writer1->WriteBatch(1, &definition_level, &repetition_level, > >> &value); > >> > > >> > int16_t rpl = 1; > >> > int32_writer1->WriteBatch(1, &definition_level, &rpl, &value); > >> > > >> > This works better (using the parquet reader doesn't yield into reading > >> NULL > >> > values), but I still can't read the resulting parquet file from > >> > Presto/Athena. > >> > I would like to have as final result when queries from Presto/Athena: > >> > id my_array > >> > 1 array[1000, 1000] > >> > > >> > What I currently get is > >> > id my_array > >> > 1 > >> > > >> > Regarding using parquet::arrow API, is there any docs? that I can look > >> to > >> > get me started? Also, is there any performance penalties by using > >> > parquet::arrow instead of the parquet lower api? > >> > > >> > 2017-12-09 1:13 GMT+01:00 Wes McKinney <wesmck...@gmail.com>: > >> > > >> >> Didn't realize this question was on the Arrow mailing list instead of > >> >> the Parquet mailing list! > >> >> > >> >> You can make things much easier on yourself by putting your data in > >> >> Arrow arrays and using the parquet::arrow APIs. > >> >> > >> >> If you want to write the data using the lower-level Parquet column > >> >> writer API, you will have to be careful with the repetition/definition > >> >> levels. In your case, I believe the values you write need to have > >> >> definition level 2 (the repeated node and optional node both increment > >> >> the definition level by 1). > >> >> > >> >> I find this blog helpful for this > >> >> https://blog.twitter.com/engineering/en_us/a/2013/dremel- > >> made-simple-with- > >> >> parquet.html. > >> >> There is also the Google Dremel paper > >> >> > >> >> - Wes > >> >> > >> >> On Fri, Dec 8, 2017 at 6:19 PM, Renato Marroquín Mogrovejo > >> >> <renatoj.marroq...@gmail.com> wrote: > >> >> > Thanks Wes! So I create it this way, but I still don't know how to > >> >> populate > >> >> > and > >> >> > > >> >> > auto element = PrimitiveNode::Make("element", Repetition::OPTIONAL, > >> >> > Type::INT32); > >> >> > auto list = GroupNode::Make("list", Repetition::REPEATED, {element}); > >> >> > auto my_array = GroupNode::Make("my_array", Repetition::REQUIRED, > >> {list}, > >> >> > LogicalType::LIST); > >> >> > fields.push_back(PrimitiveNode::Make("id", Repetition::REQUIRED, > >> >> > Type::INT32, LogicalType::NONE)); > >> >> > fields.push_back(my_array); > >> >> > auto my_schema = GroupNode::Make("schema", Repetition::REQUIRED, > >> fields); > >> >> > > >> >> > I tried populating it this way: > >> >> > > >> >> > parquet::Int32Writer* int32_writer1 = > >> >> > static_cast<parquet::Int32Writer*>(rg_writer->NextColumn()); > >> >> > for (int i = 0; i < NROWS_GROUP; i++) { > >> >> > int32_t value = i; > >> >> > int16_t definition_level = 1; > >> >> > int16_t repetition_level = 0; > >> >> > if ((i+1)%2 == 0) { > >> >> > repetition_level = 1; // start of a new record > >> >> > } > >> >> > int32_writer1->WriteBatch(1, &definition_level, > >> >> &repetition_level, > >> >> > &value); > >> >> > } > >> >> > > >> >> > That seems to work, but I can't use the generated file on Athena and > >> >> using > >> >> > the parquet_reader from parquet_cpp returns NULLs on the elements. > >> Is it > >> >> > that I have to get a handle to the list element? Thanks again for the > >> >> help! > >> >> > > >> >> > >> > > > >