Re: [PR] feat(r/sedonadb): Add `sd_write_parquet()` to R bindings [sedona-db]

via GitHub Sat, 18 Oct 2025 13:54:22 -0700


Copilot commented on code in PR #210:
URL: https://github.com/apache/sedona-db/pull/210#discussion_r2423032922



##########
r/sedonadb/R/dataframe.R:
##########
@@ -182,6 +182,82 @@ sd_preview <- function(.data, n = NULL, ascii = NULL, 
width = NULL) {
   invisible(.data)
 }
 
+#' Write DataFrame to (Geo)Parquet files
+#'
+#' Write this DataFrame to one or more (Geo)Parquet files. For input that 
contains
+#' geometry columns, GeoParquet metadata is written such that suitable readers 
can
+#' recreate Geometry/Geography types when reading the output and potentially 
read
+#' fewer row groups when only a subset of the file is needed for a given query.
+#'
+#' @inheritParams sd_count
+#' @param path A filename or directory to which parquet file(s) should be 
written
+#' @param partition_by A character vector of column names to partition by. If 
non-empty,
+#'   applies hive-style partitioning to the output
+#' @param sort_by A character vector of column names to sort by. Currently only
+#'   ascending sort is supported
+#' @param single_file_output Use TRUE or FALSE to force writing a single 
Parquet
+#'   file vs. writing one file per partition to a directory. By default,
+#'   a single file is written if `partition_by` is unspecified and
+#'   `path` ends with `.parquet`
+#' @param geoparquet_version GeoParquet metadata version to write if output 
contains
+#'   one or more geometry columns. The default ("1.0") is the most widely
+#'   supported and will result in geometry columns being recognized in many
+#'   readers; however, only includes statistics at the file level.
+#'   Use "1.1" to compute an additional bounding box column
+#'   for every geometry column in the output: some readers can use these 
columns
+#'   to prune row groups when files contain an effective spatial ordering.
+#'   The extra columns will appear just before their geometry column and
+#'   will be named "[geom_col_name]_bbox" for all geometry columns except
+#'   "geometry", whose bounding box column name is just "bbox"
+#' @param overwrite_bbox_columns Use TRUE to overwrite any bounding box columns
+#'   that already exist in the input. This is useful in a read -> modify
+#'   -> write scenario to ensure these columns are up-to-date. If FALSE
+#'   (the default), an error will be raised if a bbox column already exists
+#'
+#' @returns The input, invisibly
+#' @export
+#'
+#' @examples
+#' tmp_parquet <- tempfile(fileext = ".parquet")
+#'
+#' sd_sql("SELECT ST_SetSRID(ST_Point(1, 2), 4326) as geom") |>
+#'   sd_write_parquet(tmp_parquet)
+#'
+#' sd_read_parquet(tmp_parquet)

Review Comment:
   The example should include cleanup of the temporary file using `unlink()` or 
`on.exit()` to follow best practices for temporary file management in 
documentation examples.
   ```suggestion
   #' sd_read_parquet(tmp_parquet)
   #' unlink(tmp_parquet)
   ```



##########
r/sedonadb/src/init.c:
##########
@@ -16,9 +16,9 @@
 // under the License.
 
 #include <Rinternals.h>
+#include <stdint.h>
 
 #include <R_ext/Parse.h>
-#include <stdint.h>
 

Review Comment:
   The include order has been changed unnecessarily. Standard practice is to 
include system headers before local headers. Move `#include <stdint.h>` back to 
after `#include <R_ext/Parse.h>`.



##########
r/sedonadb/man/sd_write_parquet.Rd:
##########
@@ -0,0 +1,70 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/dataframe.R
+\name{sd_write_parquet}
+\alias{sd_write_parquet}
+\title{Write DataFrame to (Geo)Parquet files}
+\usage{
+sd_write_parquet(
+  .data,
+  path,
+  partition_by = character(0),
+  sort_by = character(0),
+  single_file_output = NULL,
+  geoparquet_version = "1.0",
+  overwrite_bbox_columns = FALSE
+)
+}
+\arguments{
+\item{.data}{A sedonadb_dataframe}
+
+\item{path}{A filename or directory to which parquet file(s) should be written}
+
+\item{partition_by}{A character vector of column names to partition by. If 
non-empty,
+applies hive-style partitioning to the output}
+
+\item{sort_by}{A character vector of column names to sort by. Currently only
+ascending sort is supported}
+
+\item{single_file_output}{Use TRUE or FALSE to force writing a single Parquet
+file vs. writing one file per partition to a directory. By default,
+a single file is written if \code{partition_by} is unspecified and
+\code{path} ends with \code{.parquet}}
+
+\item{geoparquet_version}{GeoParquet metadata version to write if output 
contains
+one or more geometry columns. The default ("1.0") is the most widely
+supported and will result in geometry columns being recognized in many
+readers; however, only includes statistics at the file level.
+Use "1.1" to compute an additional bounding box column
+for every geometry column in the output: some readers can use these columns
+to prune row groups when files contain an effective spatial ordering.
+The extra columns will appear just before their geometry column and
+will be named "\link{geom_col_name}_bbox" for all geometry columns except
+"geometry", whose bounding box column name is just "bbox"}
+
+\item{overwrite_bbox_columns}{Use TRUE to overwrite any bounding box columns
+that already exist in the input. This is useful in a read -> modify
+-> write scenario to ensure these columns are up-to-date. If FALSE
+(the default), an error will be raised if a bbox column already exists}
+}
+\value{
+The input, invisibly
+}
+\description{
+Write this DataFrame to one or more (Geo)Parquet files. For input that contains
+geometry columns, GeoParquet metadata is written such that suitable readers can
+recreate Geometry/Geography types when reading the output and potentially read
+fewer row groups when only a subset of the file is needed for a given query.
+}
+\examples{
+\dontrun{
+# Write to a single parquet file
+df <- sd_read_parquet("input.parquet")
+sd_write_parquet(df, "output.parquet")
+
+# Write partitioned output
+sd_write_parquet(df, "output_dir", partition_by = "column_name")
+
+# Write with sorting
+sd_write_parquet(df, "output.parquet", sort_by = c("col1", "col2"))

Review Comment:
   The documentation examples should include cleanup of created files (e.g., 
`unlink()`) and use `tempfile()` for output paths to avoid creating files in 
the user's working directory.
   ```suggestion
   out_file <- tempfile(fileext = ".parquet")
   sd_write_parquet(df, out_file)
   # Clean up
   unlink(out_file)
   
   # Write partitioned output
   out_dir <- tempfile()
   sd_write_parquet(df, out_dir, partition_by = "column_name")
   # Clean up
   unlink(out_dir, recursive = TRUE)
   
   # Write with sorting
   out_file2 <- tempfile(fileext = ".parquet")
   sd_write_parquet(df, out_file2, sort_by = c("col1", "col2"))
   # Clean up
   unlink(out_file2)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(r/sedonadb): Add `sd_write_parquet()` to R bindings [sedona-db]

Reply via email to