[ https://issues.apache.org/jira/browse/ARROW-15151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462733#comment-17462733 ]
Carl Boettiger commented on ARROW-15151: ---------------------------------------- > The new dataset writer in 6.0.x has options to limit the max file size (which > should get you multiple files) Yay! that's fantastic! I'll just keep an eye on https://issues.apache.org/jira/browse/ARROW-13703 then :) > write_dataset() never increments {i} in partitions part-{i} > ------------------------------------------------------------ > > Key: ARROW-15151 > URL: https://issues.apache.org/jira/browse/ARROW-15151 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 6.0.1 > Environment: Ubuntu 21.04 > Reporter: Carl Boettiger > Priority: Major > > Introducing partitioning in write_dataset() creates sub-folders just fine, > but the lowest-level subfolder only ever contains a part-0.parquet. I don't > see how to get write_dataset() to ever generate output with multiple > part-filenames in a single directory, like part-0.parquet, part-1.parquet, > etc. e.g. the documentation for open_dataset() implies we should get three > `Z` level parts: > {code:java} > # You can also partition by the values in multiple columns > # (here: "cyl" and "gear"). > # This creates a structure of the form cyl=X/gear=Y/part-Z.parquet. > two_levels_tree <- tempfile() > write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear")) > list.files(two_levels_tree, recursive = TRUE) > # In the two previous examples we would have: > # X = {4,6,8}, the number of cylinders. > # Y = {3,4,5}, the number of forward gears. > # Z = {0,1,2}, the number of saved parts, starting from 0. {code} > But I only get the expected structure with part-0.parquet files. > > > Context: I frequently need to partition large files that lack any natural > grouping variable; I merely want a bunch of small parts of equal size. It > would be great if there was an automatic way of doing this; currently I can > hack this by creating a partition column with integers 1...n where n is my > desired number of partitions, and partition on that. I'd then like to write > these to a flat structure with part-0.parquet, part-1.parquet etc, not a > nested folder structure, if possible. > (Or better yet, it would be amazing if write_dataset() just let us set a > maximum partition file size and could automate the sharding into parts while > preserving the existing behavior for actually semantically meaningful groups. > Maybe that is already the intent but I cannot see how to activate it!) -- This message was sent by Atlassian Jira (v8.20.1#820001)