[jira] [Commented] (ARROW-15151) write_dataset() never increments {i} in partitions part-{i}

Carl Boettiger (Jira) Mon, 20 Dec 2021 09:21:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462733#comment-17462733
 ]


Carl Boettiger commented on ARROW-15151:
----------------------------------------

> The new dataset writer in 6.0.x has options to limit the max file size (which 
> should get you multiple files)

 

Yay!  that's fantastic! I'll just keep an eye on 
https://issues.apache.org/jira/browse/ARROW-13703 then :)

>  write_dataset() never increments {i} in partitions part-{i}
> ------------------------------------------------------------
>
>                 Key: ARROW-15151
>                 URL: https://issues.apache.org/jira/browse/ARROW-15151
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>         Environment: Ubuntu 21.04
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Introducing partitioning in write_dataset() creates sub-folders just fine, 
> but the lowest-level subfolder only ever contains a part-0.parquet.  I don't 
> see how to get write_dataset() to ever generate output with multiple 
> part-filenames in a single directory, like part-0.parquet, part-1.parquet, 
> etc.  e.g. the documentation for open_dataset() implies we should get three 
> `Z` level parts:
> {code:java}
> # You can also partition by the values in multiple columns
> # (here: "cyl" and "gear").
> # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
> two_levels_tree <- tempfile()
> write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
> list.files(two_levels_tree, recursive = TRUE)
> # In the two previous examples we would have:
> # X = {4,6,8}, the number of cylinders.
> # Y = {3,4,5}, the number of forward gears.
> # Z = {0,1,2}, the number of saved parts, starting from 0. {code}
> But I only get the expected structure with part-0.parquet files.
>  
>  
> Context: I frequently need to partition large files that lack any natural 
> grouping variable; I merely want a bunch of small parts of equal size.  It 
> would be great if there was an automatic way of doing this; currently I can 
> hack this by creating a partition column with integers 1...n where n is my 
> desired number of partitions, and partition on that.  I'd then like to write 
> these to a flat structure with part-0.parquet, part-1.parquet etc, not a 
> nested folder structure, if possible. 
> (Or better yet, it would be amazing if write_dataset() just let us set a 
> maximum partition file size and could automate the sharding into parts while 
> preserving the existing behavior for actually semantically meaningful groups. 
>  Maybe that is already the intent but I cannot see how to activate it!)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15151) write_dataset() never increments {i} in partitions part-{i}

Reply via email to