[GitHub] [arrow-site] ianmcook commented on a change in pull request #127: Blog post for 5.0.0 release

GitBox Tue, 03 Aug 2021 13:09:32 -0700


ianmcook commented on a change in pull request #127:
URL: https://github.com/apache/arrow-site/pull/127#discussion_r682066764




##########
File path: _posts/2021-07-20-5.0.0-release.md
##########
@@ -0,0 +1,267 @@
+---
+layout: post
+title: "Apache Arrow 5.0.0 Release"
+date: "2020-07-29 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 5.0.0 release. This covers
+3 months of development work and includes **684 commits** from
+[**99 distinct contributors**][1] in 2 repositories. See the Install Page to
+learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the complete changelogs for the [`apache/arrow`][2] and
+[`apache/arrow-rs`][3] repositories.
+
+## Community
+
+Since the 4.0.0 release, Daniël Heres, Kazuaki Ishizaki, Dominik Moritz, and 
Weston Pace
+have been invited as committers to Arrow,
+and Benjamin Kietzman and David Li have joined the Project Management Committee
+(PMC). Thank you for all of your contributions!
+
+## Columnar Format Notes
+
+Official IANA Media types (MIME types) have been registered for Apache
+Arrow IPC protocol data, both [stream]({{ site.baseurl 
}}/docs/format/Columnar.html#ipc-streaming-format)
+and [file]({{ site.baseurl }}/docs/format/Columnar.html#ipc-file-format) 
variants:
+
+* 
https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.stream
+* 
https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file
+
+We recommend `.arrow` as the IPC file format file extension and `.arrows` for
+the IPC streaming format file extension.
+
+## Arrow Flight RPC notes
+
+The Go implementation now supports custom metadata and middleware, and has
+been added to integration testing.
+
+In Python, some operations can now be interrupted via Control-C.
+
+## C++ notes
+
+`MakeArrayFromScalar` now works for fixed-size binary types (ARROW-13321).
+
+### Compute layer
+
+The following [compute functions]({{site.baseurl}}/docs/cpp/compute.html)
+were added:
+
+* aggregations: `index`
+
+* scalar arithmetic and math functions: `abs`, `abs_checked`, `acos`,
+  `acos_checked`, `asin`, `asin_checked`, `atan`, `atan2`, `ceil`, `cos`,
+  `cos_checked`, `floor`, `ln`, `ln_checked`, `log10`, `log10_checked`,
+  `log1p`, `log1p_checked`, `log2`, `log2_checked`, `negate`, `negate_checked`,
+  `sign`, `sin`, `sin_checked`, `tan`, `tan_checked`, `trunc`
+
+* scalar bitwise functions: `bit_wise_and`, `bit_wise_not`, `bit_wise_or`,
+  `bit_wise_xor`, `shift_left`, `shift_left_checked`, `shift_right`,
+  `shift_right_checked`
+
+* scalar string functions: `ascii_center`, `ascii_lpad`, `ascii_reverse`,
+  `ascii_rpad`, `binary_join`, `binary_join_element_wise`,
+  `binary_replace_slice`, `count_substring`, `count_substring_regex`,
+  `ends_with`, `find_substring`, `find_substring_regex`, `match_like`,
+  `split_pattern_regex`, `starts_with`, `utf8_center`, `utf8_lpad`,
+  `utf8_replace_slice`, `utf8_rpad`, `utf8_reverse`, `utf8_slice_codepoints`
+
+* scalar temporal functions: `day`, `day_of_week`, `day_of_year`,
+  `iso_calendar`, `iso_week`, `iso_year`, `hour`, `microsecond`, `millisecond`,
+  `minute`, `month`, `nanosecond`, `quarter`, `second`, `subsecond`, `year`
+
+* other scalar functions: `case_when`, `coalesce`, `if_else`, `is_finite`,
+  `is_inf`, `is_nan`, `max_element_wise`, `min_element_wise`, `make_struct`
+
+* vector functions: `replace_with_mask`
+
+Duplicates are now allowed in `SetLookupOptions::value_set` (ARROW-12554).
+
+Decimal types are now supported by some basic arithmetic functions 
(ARROW-12074).
+
+The `take` function now supports dense unions (ARROW-13005).
+
+It is now possible to cast between dictionary types with different index
+types (ARROW-11673).
+
+Sorting is now implemented for boolean input (ARROW-12016).
+
+### CSV
+
+The streaming CSV reader can now take some advantage of multiple threads 
(ARROW-11889).
+
+The CSV reader tries to make its errors more informative by adding the
+row number when it is known, i.e. when parallel reading is disabled 
(ARROW-12675).
+
+A new option `ReaderOptions::skip_rows_after_names` allows skipping a number
+of rows _after_ reading the column names (as opposed to
+`ReaderOptions::skip_rows`).
+
+Quoted strings can now be treated as always non-null (ARROW-10115).
+
+### Dataset layer
+
+The asynchronous scanner introduced in 4.0.0 has been improved with truly 
+asynchronous readers implemented for CSV, Parquet, and IPC file formats and 
+file-level parallelism added.  This mode is controlled by a flag `use_async` 
that
+can be passed into methods which scan a dataset.  Setting this flag to True
+will have significant improvements on filesystems with high latency or parallel
+reads (e.g. S3).
+
+A CountRows method has been added to count rows matching a predicate; where
+possible, this will use metadata in files instead of reading the data itself.
+
+CSV datasets can now be written, and when reading a CSV dataset, explicit 
types can
+now be specified for a subset of columns while allowing the rest to still be 
inferred. 
+
+### IO and Filesystem layer
+
+The I/O thread pool size can now be adjusted at runtime (ARROW-12760).
+The default size remains 8 threads.
+
+Streams now can have auxiliary metadata, depending on the backend.  This
+has been implemented for the S3 filesystems, where a couple metadata
+keys are supported such as `Content-Type` and `ACL` (ARROW-11161, ARROW-12719).
+
+The HadoopFileSystem implementation now implements the FileSystem abstraction
+more faithfully (ARROW-12790).
+
+### Parquet
+
+The new LZ4_RAW compression scheme was implemented (PARQUET-1998).
+Unlike the legacy LZ4 compression scheme, it is defined unambiguously
+and should provide better portability once other Parquet implementations
+catch up.
+
+## Go notes
+
+
+### Flight 
+
+* Flight Client and Server now support Custom Metadata through the functions 
`flight.NewClientWithMiddleware` and `flight.NewServerWithMiddleware`. Functions
+`flight.NewFlightClient`, `flight.NewFlightServer`, 
`flight.CreateServerBearerTokenAuthInterceptors` have been deprecated in favor 
of using the new middleware. 
[#10633](https://github.com/apache/arrow/pull/10633)
+* Flight Client `AuthHandler` no longer overwrites outgoing metadata, 
correctly appending new metadata without overwriting existing metadata 
[#10297](https://github.com/apache/arrow/pull/10297)
+* Flight AppMetadata field is now exposed both for Reading and Writing via 
`flight.Reader#LatestAppMetadata()` and `flight.Writer#WriteWithAppMetadata` 
functions [#10142](https://github.com/apache/arrow/pull/10142)
+
+### Other enhancements 
+
+* [Map](https://github.com/apache/arrow/pull/10106) and 
[Extension](https://github.com/apache/arrow/pull/10203) Datatypes are now 
implemented for Arrow Arrays 
+* [Schema package](https://github.com/apache/arrow/10071) and first part of 
[Encoding package](https://github.com/apache/arrow/10379) added for Golang 
Parquet Implementation
+
+## Java notes
+
+Highlighted improvements and fixes:
+
+* Improved support for extension types using a complex storage type, e.g. 
struct, map or union. These can now extend
+the `ExtensionTypeVector` base class.
+* Union vectors now extend `AbstractContainerVector` to be consistent with 
other vectors.
+* Guava dependency updated to 30.1.1
+* Memory leak fixed if an exception occurs when reading IPC messages from a 
channel. [#10423](https://github.com/apache/arrow/pull/10423)
+* Flight error metadata is now propagated to the client. 
[#10370](https://github.com/apache/arrow/pull/10370)
+* JDBC adapter now preserves nullability. 
[#10285](https://github.com/apache/arrow/pull/10285)
+* The memory rounding policy is respected when allocating vector buffers. This 
helps saving memory space. [#10576](https://github.com/apache/arrow/pull/10576)
+
+API compatibility changes:
+
+* Complex vectors now return covariant types from `getObject(int)`. 
[#9964](https://github.com/apache/arrow/pull/9964)
+
+## JavaScript notes
+
+* Tables do not extend DataFrames anymore. This enables smaller bundles. 
[#10277](https://github.com/apache/arrow/pull/10277)
+* Arrow uses closure compiler for all UMD bundles, making them smaller. 
[#10281](https://github.com/apache/arrow/pull/10281)
+* The npm package now comes with declaration maps for better navigation from 
types to source code. [#10673](https://github.com/apache/arrow/pull/10673)
+* Updated dependencies and improvements to the code.
+
+## Python notes
+
+* Datasets can now scan files asynchronously when the `use_async=True` option 
is provided to `Dataset.scanner`, `Dataset.to_table`, or `Dataset.to_batches` 
methods. This should provide better performance in environments where I/O can 
be slow, such as with remote sources.
+* Arrow now provides builtin support for writing CSV files through 
`pyarrow.csv.write_csv`
+* Wheels for Apple M1 Macs are now provided.
+* Many new `pyarrow.compute` functions are available (see the C++ notes above
+  for more details), and introspection of the functions was improved so that
+  they look more like standard Python functions.
+* It is now possible to access ORC file metadata from Python `ORCFile` objects
+* Building a `StructArray` now accepts a `mask` like other arrays
+* Many updates and fixes for the documentation
+
+## R notes
+
+In this release, we've more than doubled the number of functions you can call 
on Arrow Datasets inside `dplyr::filter()` and `mutate()`, including many more 
string, datetime, and math functions. You can also write Datasets to CSV files, 
in addition to Parquet and Feather. We've also deepened support for the Arrow C 
interface, which is used in the Python interface and allows integration with 
other projects, such as DuckDB.

Review comment:
       ```suggestion
   In this release, we've more than doubled the number of functions you can 
call on Arrow Datasets inside `dplyr::filter()`, `mutate()`, and `arrange()`, 
including many more string, datetime, and math functions. You can also write 
Datasets to CSV files, in addition to Parquet and Feather. We've also deepened 
support for the Arrow C interface, which is used in the Python interface and 
allows integration with other projects, such as DuckDB.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-site] ianmcook commented on a change in pull request #127: Blog post for 5.0.0 release

Reply via email to