This is an automated email from the ASF dual-hosted git repository.
npr pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 6a3f9eb ARROW-9473: [Doc] Polishing for 1.0
6a3f9eb is described below
commit 6a3f9ebb0eb118b9289694c5085f6d9ab2aa575d
Author: Neal Richardson <[email protected]>
AuthorDate: Tue Jul 14 16:06:17 2020 -0700
ARROW-9473: [Doc] Polishing for 1.0
Closes #7766 from nealrichardson/docs-1.0
Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
---
docs/source/format/Versioning.rst | 12 +++++-----
r/NEWS.md | 35 ++++++++++++++++++++++++-----
r/R/ipc_stream.R | 3 +--
r/R/record-batch.R | 5 -----
r/R/schema.R | 2 +-
r/R/table.R | 5 -----
r/README.md | 1 -
r/_pkgdown.yml | 2 +-
r/man/RecordBatch.Rd | 5 -----
r/man/Table.Rd | 5 -----
r/man/read_ipc_stream.Rd | 3 +--
r/man/unify_schemas.Rd | 2 +-
r/tests/testthat/test-read-record-batch.R | 2 +-
r/tests/testthat/test-record-batch-reader.R | 7 +++++-
r/vignettes/install.Rmd | 4 ++--
15 files changed, 50 insertions(+), 43 deletions(-)
diff --git a/docs/source/format/Versioning.rst
b/docs/source/format/Versioning.rst
index 2ed6670..b706569 100644
--- a/docs/source/format/Versioning.rst
+++ b/docs/source/format/Versioning.rst
@@ -18,8 +18,8 @@
Format Versioning and Stability
===============================
-Starting with version 1.0.0 (not yet released), Apache Arrow utilizes
-**two versions** to describe each release of the project. These are
+Starting with version 1.0.0, Apache Arrow utilizes
+**two versions** to describe each release of the project:
the **Format Version** and the **Library Version**. Each Library
Version has a corresponding Format Version, and multiple versions of
the library may have the same format version. For example, library
@@ -56,15 +56,15 @@ Long-Term Stability
A change in the format major version (e.g. from 1.0.0 to 2.0.0)
indicates a disruption to these compatibility guarantees in some way.
-We **do not expect** this to be a frequent occurrence starting with
-the 1.0.0 library and format release. This would be an exceptional
+We **do not expect** this to be a frequent occurrence.
+This would be an exceptional
event and, should this come to pass, we would exercise caution in
ensuring that production applications are not harmed.
Pre-1.0.0 Versions
------------------
-We have made no forward or backward compatibility guarantees for
-versions prior to 1.0.0. However, we are making every effort to ensure
+We made no forward or backward compatibility guarantees for
+versions prior to 1.0.0. However, we made every effort to ensure
that new clients can read serialized data produced by library version
0.8.0 and onward.
diff --git a/r/NEWS.md b/r/NEWS.md
index 1679e9a..c810487 100644
--- a/r/NEWS.md
+++ b/r/NEWS.md
@@ -19,22 +19,47 @@
# arrow 0.17.1.9000
+## Arrow format conversion
+
+* `vignette("arrow", package = "arrow")` includes tables that explain how R
types are converted to Arrow types and vice versa.
+* Support added for converting to/from more Arrow types: `uint64`, `binary`,
`fixed_size_binary`, `large_binary`, `large_utf8`, `large_list`, `list` of
`structs`.
+* `character` vectors that exceed 2GB are converted to Arrow `large_utf8` type
+* `POSIXlt` objects can now be converted to Arrow (`struct`)
+* R `attributes()` are preserved in Arrow metadata when converting to Arrow
RecordBatch and table and are restored when converting from Arrow. This means
that custom subclasses, such as `haven::labelled`, are preserved in round trip
through Arrow.
+* Schema metadata is now exposed as a named list, and it can be modified by
assignment like `batch$metadata$new_key <- "new value"`
+* Arrow types `int64`, `uint32`, and `uint64` now are converted to R `integer`
if all values fit in bounds
+* Arrow `date32` is now converted to R `Date` with `double` underlying
storage. Even though the data values themselves are integers, this provides
more strict round-trip fidelity
+* When converting to R `factor`, `dictionary` ChunkedArrays that do not have
identical dictionaries are properly unified
+* In the 1.0 release, the Arrow IPC metadata version is increased from V4 to
V5. By default, `RecordBatch{File,Stream}Writer` will write V5, but you can
specify an alternate `metadata_version`. For convenience, if you know the
consumer you're writing to cannot read V5, you can set the environment variable
`ARROW_PRE_1_0_METADATA_VERSION=1` to write V4 without changing any other code.
+
## Datasets
* CSV and other text-delimited datasets are now supported
-* Read datasets directly on S3 by passing a URL like `ds <-
open_dataset("s3://...")`. Note that this currently requires a special C++
library build with additional dependencies; that is, this is not yet available
in CRAN releases or in nightly packages.
+* With a custom C++ build, it is possible to read datasets directly on S3 by
passing a URL like `ds <- open_dataset("s3://...")`. Note that this currently
requires a special C++ library build with additional dependencies--this is not
yet available in CRAN releases or in nightly packages.
* When reading individual CSV and JSON files, compression is automatically
detected from the file extension
-## Other
+## Other enhancements
* Initial support for C++ aggregation methods: `sum()` and `mean()` are
implemented for `Array` and `ChunkedArray`
-* Schema metadata is now exposed as a named list, and it can be modified by
assignment like `batch$metadata$new_key <- "new value"`
* Tables and RecordBatches have additional data.frame-like methods, including
`dimnames()` and `as.list()`
-* Linux installation: some tweaks to OS detection for binaries, some updates
to known installation issues in the vignette.
-* Various streamlining efforts to reduce library size and compile time.
+* Tables and ChunkedArrays can now be moved to/from Python via `reticulate`
+
+## Bug fixes and deprecations
+
+* Non-UTF-8 strings (common on Windows) are correctly coerced to UTF-8 when
passing to Arrow memory and appropriately re-localized when converting to R
+* The `coerce_timestamps` option to `write_parquet()` is now correctly
implemented.
+* Creating a Dictionary array respects the `type` definition if provided by
the user
* `read_arrow` and `write_arrow` are now deprecated; use the
`read/write_feather()` and `read/write_ipc_stream()` functions depending on
whether you're working with the Arrow IPC file or stream format, respectively.
* Previously deprecated `FileStats`, `read_record_batch`, and `read_table`
have been removed.
+## Installation and packaging
+
+* For improved performance in memory allocation, macOS and Linux binaries now
have `jemalloc` included, and Windows packages use `mimalloc`
+* Linux installation: some tweaks to OS detection for binaries, some updates
to known installation issues in the vignette
+* The bundled libarrow is built with the same `CC` and `CXX` values that R uses
+* Failure to build the bundled libarrow yields a clear message
+* Various streamlining efforts to reduce library size and compile time
+
# arrow 0.17.1
* Updates for compatibility with `dplyr` 1.0
diff --git a/r/R/ipc_stream.R b/r/R/ipc_stream.R
index ebc5b77..0c728b2 100644
--- a/r/R/ipc_stream.R
+++ b/r/R/ipc_stream.R
@@ -82,8 +82,7 @@ write_to_raw <- function(x, format = c("stream", "file")) {
#' `read_arrow()`, a wrapper around `read_ipc_stream()` and `read_feather()`,
#' is deprecated. You should explicitly choose
#' the function that will read the desired IPC format (stream or file) since
-#' a file or `InputStream` may contain either. `read_table()`, a wrapper around
-#' `read_arrow()`, is also deprecated
+#' a file or `InputStream` may contain either.
#'
#' @param file A character file name, `raw` vector, or an Arrow input stream.
#' If a file name, a memory-mapped Arrow [InputStream] will be opened and
diff --git a/r/R/record-batch.R b/r/R/record-batch.R
index 6e4705f..cc68348 100644
--- a/r/R/record-batch.R
+++ b/r/R/record-batch.R
@@ -38,11 +38,6 @@
#' "Slice" method function even if there were a column in the table called
#' "Slice".
#'
-#' A caveat about the `[` method for row operations: only "slicing" is
-#' currently supported. That is, you can select a continuous range of rows
-#' from the table, but you can't filter with a `logical` vector or take an
-#' arbitrary selection of rows by integer indices.
-#'
#' @section R6 Methods:
#' In addition to the more R-friendly S3 methods, a `RecordBatch` object has
#' the following R6 methods that map onto the underlying C++ methods:
diff --git a/r/R/schema.R b/r/R/schema.R
index 839326a..963e5f4 100644
--- a/r/R/schema.R
+++ b/r/R/schema.R
@@ -189,7 +189,7 @@ read_schema <- function(stream, ...) {
#' \dontrun{
#' a <- schema(b = double(), c = bool())
#' z <- schema(b = double(), k = utf8())
-#' unify_schemas(a, z),
+#' unify_schemas(a, z)
#' }
unify_schemas <- function(..., schemas = list(...)) {
shared_ptr(Schema, arrow__UnifySchemas(schemas))
diff --git a/r/R/table.R b/r/R/table.R
index 64095f8..1391eee 100644
--- a/r/R/table.R
+++ b/r/R/table.R
@@ -47,11 +47,6 @@
#' "Slice" method function even if there were a column in the table called
#' "Slice".
#'
-#' A caveat about the `[` method for row operations: only "slicing" is
-#' currently supported. That is, you can select a continuous range of rows
-#' from the table, but you can't filter with a `logical` vector or take an
-#' arbitrary selection of rows by integer indices.
-#'
#' @section R6 Methods:
#' In addition to the more R-friendly S3 methods, a `Table` object has
#' the following R6 methods that map onto the underlying C++ methods:
diff --git a/r/README.md b/r/README.md
index e8972a0..a0e2034 100644
--- a/r/README.md
+++ b/r/README.md
@@ -3,7 +3,6 @@
[](https://cran.r-project.org/package=arrow)
[](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
[](https://anaconda.org/conda-forge/r-arrow)
-[](https://codecov.io/gh/ursa-labs/arrow-r-nightly)
[Apache Arrow](https://arrow.apache.org/) is a cross-language
development platform for in-memory data. It specifies a standardized
diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml
index c68d153..ff48eef 100644
--- a/r/_pkgdown.yml
+++ b/r/_pkgdown.yml
@@ -101,6 +101,7 @@ reference:
- record_batch
- RecordBatch
- Table
+ - Scalar
- read_message
- title: Arrow data types and schema
contents:
@@ -130,7 +131,6 @@ reference:
- default_memory_pool
- FileSystem
- FileInfo
- - FileStats
- FileSelector
- title: Configuration
contents:
diff --git a/r/man/RecordBatch.Rd b/r/man/RecordBatch.Rd
index a57bb0c..40c3496 100644
--- a/r/man/RecordBatch.Rd
+++ b/r/man/RecordBatch.Rd
@@ -35,11 +35,6 @@ A caveat about the \code{$} method: because
\code{RecordBatch} is an \code{R6} o
precedence over the table's columns. So, \code{batch$Slice} would return the
"Slice" method function even if there were a column in the table called
"Slice".
-
-A caveat about the \code{[} method for row operations: only "slicing" is
-currently supported. That is, you can select a continuous range of rows
-from the table, but you can't filter with a \code{logical} vector or take an
-arbitrary selection of rows by integer indices.
}
\section{R6 Methods}{
diff --git a/r/man/Table.Rd b/r/man/Table.Rd
index aebb2b3..2014a30 100644
--- a/r/man/Table.Rd
+++ b/r/man/Table.Rd
@@ -35,11 +35,6 @@ A caveat about the \code{$} method: because \code{Table} is
an \code{R6} object,
precedence over the table's columns. So, \code{tab$Slice} would return the
"Slice" method function even if there were a column in the table called
"Slice".
-
-A caveat about the \code{[} method for row operations: only "slicing" is
-currently supported. That is, you can select a continuous range of rows
-from the table, but you can't filter with a \code{logical} vector or take an
-arbitrary selection of rows by integer indices.
}
\section{R6 Methods}{
diff --git a/r/man/read_ipc_stream.Rd b/r/man/read_ipc_stream.Rd
index 0ea54f6..1cc969b 100644
--- a/r/man/read_ipc_stream.Rd
+++ b/r/man/read_ipc_stream.Rd
@@ -33,8 +33,7 @@ and \code{\link[=read_feather]{read_feather()}} read those
formats, respectively
\code{read_arrow()}, a wrapper around \code{read_ipc_stream()} and
\code{read_feather()},
is deprecated. You should explicitly choose
the function that will read the desired IPC format (stream or file) since
-a file or \code{InputStream} may contain either. \code{read_table()}, a
wrapper around
-\code{read_arrow()}, is also deprecated
+a file or \code{InputStream} may contain either.
}
\seealso{
\code{\link[=read_feather]{read_feather()}} for writing IPC files.
\link{RecordBatchReader} for a
diff --git a/r/man/unify_schemas.Rd b/r/man/unify_schemas.Rd
index f7d01a1..a6b7ec0 100644
--- a/r/man/unify_schemas.Rd
+++ b/r/man/unify_schemas.Rd
@@ -21,6 +21,6 @@ Combine and harmonize schemas
\dontrun{
a <- schema(b = double(), c = bool())
z <- schema(b = double(), k = utf8())
-unify_schemas(a, z),
+unify_schemas(a, z)
}
}
diff --git a/r/tests/testthat/test-read-record-batch.R
b/r/tests/testthat/test-read-record-batch.R
index 2412743..8eb196a 100644
--- a/r/tests/testthat/test-read-record-batch.R
+++ b/r/tests/testthat/test-read-record-batch.R
@@ -15,7 +15,7 @@
# specific language governing permissions and limitations
# under the License.
-context("read_record_batch()")
+context("reading RecordBatches")
test_that("RecordBatchFileWriter / RecordBatchFileReader roundtrips", {
tab <- Table$create(
diff --git a/r/tests/testthat/test-record-batch-reader.R
b/r/tests/testthat/test-record-batch-reader.R
index 2b621ed..e03664e 100644
--- a/r/tests/testthat/test-record-batch-reader.R
+++ b/r/tests/testthat/test-record-batch-reader.R
@@ -80,8 +80,13 @@ test_that("MetadataFormat", {
expect_identical(get_ipc_metadata_version("V4"), 3L)
expect_identical(get_ipc_metadata_version(NULL), 4L)
Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = 1)
- on.exit(Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = ""))
expect_identical(get_ipc_metadata_version(NULL), 3L)
+ Sys.setenv(ARROW_PRE_0_15_IPC_FORMAT = "")
+
+ expect_identical(get_ipc_metadata_version(NULL), 4L)
+ Sys.setenv(ARROW_PRE_1_0_METADATA_VERSION = 1)
+ expect_identical(get_ipc_metadata_version(NULL), 3L)
+ Sys.setenv(ARROW_PRE_1_0_METADATA_VERSION = "")
expect_error(
get_ipc_metadata_version(99),
diff --git a/r/vignettes/install.Rmd b/r/vignettes/install.Rmd
index 2dad01e..d7b4156 100644
--- a/r/vignettes/install.Rmd
+++ b/r/vignettes/install.Rmd
@@ -264,8 +264,8 @@ See discussion
[here](https://issues.apache.org/jira/browse/ARROW-8586).
* If you have multiple versions of `zstd` installed on your system,
installation by building the C++ from source may fail with an undefined symbols
-error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary;
-(2) setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
+error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary;
(2)
+setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
the conflicting `zstd`.
See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).