alamb commented on issue #75:
URL: https://github.com/apache/parquet-testing/issues/75#issuecomment-2800115829

   So my plan is to create a directory of variant values, and for each value, 
provide 3 files:
   1.  `name.metadata` -- the binary contents of the metadata field
   2. `name.value` -- the binary contents of the value field
   3. `name.json` -- the equivalent JSON (as much as possible)
   
   Example values, covering the values in 
[VariantEncoding.md](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md):
   
   There should be at least 4 sets:
   1. `primitive_types/<type>` -- examples of each 21 types in 
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-primitive-type-basic_type0
   2. `short_string` -- example of short string (less than 2*64 bytes) 
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-header-for-short-string-basic_type1
   3. `object_primitive` -- an object with 21 fields, one for each primitive 
type
   4. `array_primitive` -- an array of 10 strings (maybe also strings 
   5. `object_nested` -- an object with several primitive type fields, and a 
field with a nested object (with primitives and another object) and an array of 
primitives 
   6. `array_nested` -- an array of objects
   
   
   
   Other examples that are probably needed:
   1. object with 300 fields (more than 2^8 = 256) fields (requires a more than 
1 byte num_elements)
   2. object with 66,000 fields  (more than 2^16 = 65,536 fields) requires a 3 
byte field id / offset
   3. object with 17M fields (more than 2^25 = 16,777,216 fields) requires a 4 
byte field id / offset
   
   
   I am a little worried about 2 and 3 as the objects will be non trivial in 
size -- they may need to be gzipped or something to put them into the object


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to