alamb commented on code in PR #65: URL: https://github.com/apache/datafusion-site/pull/65#discussion_r2030236857
########## content/blog/2025-03-30-datafusion-python-46.0.0.md: ########## @@ -0,0 +1,300 @@ +--- +layout: post +title: Apache DataFusion Python 46.0.0 Released +date: 2025-03-30 +author: timsaucer +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + + +We are happy to announce that [datafusion-python 46.0.0] has been released. This release +brings in all of the new features of the core [DataFusion 46.0.0] library. Since the last +blog post for [datafusion-python 43.1.0], a large number of improvements have been made +that can be found in the [changelogs]. + +We highly recommend reviewing the upstream [DataFusion 46.0.0] announcement. + +[DataFusion 46.0.0]: https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0 +[datafusion-python 43.1.0]: https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/ +[datafusion-python 46.0.0]: https://pypi.org/project/datafusion/46.0.0/ +[changelogs]: https://github.com/apache/datafusion-python/tree/main/dev/changelog + +## Easier file reading + +In these releases we have introduced two new ways to more easily read files into +DataFrames. + +PR [#982] introduced a series of easier read functions for Parquet, JSON, CSV, and +AVRO files. This introduces a concept of a global context that is available by +default when using these methods. Now instead of creating a default Session +Context and then calling the read methods, you can simply import these read +alternative methods and begin working with your DataFrames. Below is an example of +how easy to use this new approach is. + +```python +from datafusion.io import read_parquet +df = read_parquet(path="./examples/tpch/data/customer.parquet") +``` + +PR [#980] adds a method for setting up a session context to use URL tables. With +this enabled, you can use a path to a local file as a table name. An example +of how to use this is demonstrated in the following snippet. + +```python +import datafusion +ctx = datafusion.SessionContext().enable_url_table() +df = ctx.table("./examples/tpch/data/customer.parquet") +``` + +[#982]: https://github.com/apache/datafusion-python/pull/982 +[#980]: https://github.com/apache/datafusion-python/pull/980 + +## Registering Table Views + +DataFusion supports registering a logical plan as a view with a session context. This +allows for work flows to create views in one part of the work flow and pass the session +context around to other places where that logical plan can be reused. This is an useful +feature for building up complex workflows and for code clarity. PR [#1016] enables this +feature in `datafusion-python`. Review Comment: Here is a minor suggestion on wording: ```suggestion DataFusion supports registering a logical plan as a view with a session context. This allows creating views in one part of your work flow and passinng the session context to other places where that logical plan can be reused. This is an useful feature for building up complex workflows and for code clarity. PR [#1016] enables this feature in `datafusion-python`. ``` ########## content/blog/2025-03-30-datafusion-python-46.0.0.md: ########## @@ -0,0 +1,300 @@ +--- +layout: post +title: Apache DataFusion Python 46.0.0 Released +date: 2025-03-30 +author: timsaucer +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + + +We are happy to announce that [datafusion-python 46.0.0] has been released. This release +brings in all of the new features of the core [DataFusion 46.0.0] library. Since the last +blog post for [datafusion-python 43.1.0], a large number of improvements have been made +that can be found in the [changelogs]. + +We highly recommend reviewing the upstream [DataFusion 46.0.0] announcement. + +[DataFusion 46.0.0]: https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0 +[datafusion-python 43.1.0]: https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/ +[datafusion-python 46.0.0]: https://pypi.org/project/datafusion/46.0.0/ +[changelogs]: https://github.com/apache/datafusion-python/tree/main/dev/changelog + +## Easier file reading + +In these releases we have introduced two new ways to more easily read files into +DataFrames. + +PR [#982] introduced a series of easier read functions for Parquet, JSON, CSV, and +AVRO files. This introduces a concept of a global context that is available by +default when using these methods. Now instead of creating a default Session +Context and then calling the read methods, you can simply import these read +alternative methods and begin working with your DataFrames. Below is an example of +how easy to use this new approach is. + +```python +from datafusion.io import read_parquet +df = read_parquet(path="./examples/tpch/data/customer.parquet") +``` + +PR [#980] adds a method for setting up a session context to use URL tables. With +this enabled, you can use a path to a local file as a table name. An example +of how to use this is demonstrated in the following snippet. + +```python +import datafusion +ctx = datafusion.SessionContext().enable_url_table() +df = ctx.table("./examples/tpch/data/customer.parquet") +``` + +[#982]: https://github.com/apache/datafusion-python/pull/982 +[#980]: https://github.com/apache/datafusion-python/pull/980 + +## Registering Table Views + +DataFusion supports registering a logical plan as a view with a session context. This +allows for work flows to create views in one part of the work flow and pass the session +context around to other places where that logical plan can be reused. This is an useful +feature for building up complex workflows and for code clarity. PR [#1016] enables this +feature in `datafusion-python`. + +For example, supposing you have a DataFrame called `df1`, you could use this code snippet +to register the view and then use it in another place: + +```python +ctx.register_view("view1", df1) +``` + +And then in another portion of your code which has access to the same session context +you can retrive the DataFrame with: + +``` +df2 = ctx.table("view1") +``` + +[#1016]: https://github.com/apache/datafusion-python/pull/1016 + +## Asynchronous Iteration of Record Batches + +Retrieving a `RecordBatch` from a `RecordBatchStream` was a synchronous call, which would +require the end user's code to wait for the data retrieval. This is described in +[Issue 974]. We continue to support this as a synchronous iterator, but we have also added +in the ability to retrieve the `RecordBatch` using the Python asynchronous `anext` +function. + +[Issue 974]: https://github.com/apache/datafusion-python/issues/974 + +## Default Compression for Parquet files + +With PR [#981], we change the saving of Parquet files to use zstd compression by default. +Previously the default was uncompressed, causing excessive disk storage. Zstd is an +excellent compression scheme that balances speed and compression ratio. Users can still +save their Parquet files uncompressed by passing in the appropriate value to the +`compression` argument when calling `DataFrame.write_parquet`. + +[#981]: https://github.com/apache/datafusion-python/pull/981 + +## UDF Decorators + +In PRs [#1040] and [#1061] we add methods to make creating user defined functions +easier and take advantage of Python decorators. With these PRs you can save a step +from defining a method and then defining a udf of that method. Instead you can +simply add the appropriate `udf` decorator. Similar methods exist for aggregate +and window user defined functions. + +```python +@udf([pa.int64(), pa.int64()], pa.bool_(), "stable") +def my_custom_function( + age: pa.Array, + favorite_number: pa.Array, +) -> pa.Array: + pass +``` + +[#1040]: https://github.com/apache/datafusion-python/pull/1040 +[#1061]: https://github.com/apache/datafusion-python/pull/1061 + + +## `uv` package management + +[uv] is an extremely fast Python package manager, written in Rust. In the previous version +of `datafusion-python` we had a combination of settings of PyPi and Conda. Instead, we +switch to using [uv] is our primary method for dependency management. + +For most users of DataFusion, this change will be transparent. You can still install +via `pip` or `conda`. For developers, the instructions in the repository have been updated. + +[uv]: https://github.com/astral-sh/uv + +## Code cleanup + +In an effort to improve our code cleanliness and ensure we are following Python best +practices, we use [ruff] to perform Python linting. Until now we enabled only a portion +of the available linters available. In PRs [#1055] and [#1062], we enable many more +of these linters and made code improvements to ensure we are following these +recommendations. + +[ruff]: https://docs.astral.sh/ruff/ +[#1055]: https://github.com/apache/datafusion-python/pull/1055 +[#1062]: https://github.com/apache/datafusion-python/pull/1062 + +## Improved Jupyter Notebook rendering + +Since PR [#839] in DataFusion 41.0.0 we have been able to render DataFrames using html in +[jupyter] notebooks. This is a big improvement over the `show` command when we have the +ability to render tables. In PR [#1036] we went a step further and added in a variety +of features. + +- Now html tables are scrollable, vertically and horizontally. +- When data are truncated, we report this to the user. +- Instead of showing a small number of rows, we collect up to 2 megabytes of data to +display. Since we have scrollable tables, we are able to make more data available +to the user without sacrificing notebook usability. +- We report explicitly when the DataFrame is empty. Previously we would not output +anything for an empty table. This indicator is helpful to users to ensure their plans +are written correctly. Sometimes a non-output can be overlooked. +- For long output of data, we generate a collapsed view of the data with an option +for the user to click on it to expand the data. + +In the below view you can see an example of some of these features such as the +expandable text and scroll bars. + +<figure style="text-align: center;"> + <img + src="/blog/images/python-datafusion-46.0.0/html_rendering.png" + width="100%" + class="img-responsive" + alt="Fig 1: Example html rendering in a jupyter notebook." + > + <figcaption> + <b>Figure 1</b>: With the html rendering enhancements, tables are more easily + viewable in jupyter notebooks. +</figcaption> +</figure> + +[jupyter]: https://jupyter.org/ +[#839]: https://github.com/apache/datafusion-python/pull/839 +[#1036]: https://github.com/apache/datafusion-python/pull/1036 + +## Extension Documentation + +We have recently added [Extension Documentation] to the DataFusion in Python website. We +have received many requests about how to better understand how to integrate DataFusion +in Python with other Rust libraries. To address these questions we wrote an article about +some of the difficulties that we encounter when using Rust libraries in Python and our +approach to addressing them. + +[Extension Documentation]: https://datafusion.apache.org/python/contributor-guide/ffi.html + +## Migration Guide + +During the upgrade from [DataFusion 43.0.0] to [DataFusion 44.0.0] as our upstream core +dependency, we discovered a few changes were necessary within our repository and our +unit tests. These notes serve to help guide users who may encounter similar issues when +upgrading. + +- `RuntimeConfig` is now deprecated in favor of `RuntimeEnvBuilder`. The migration is +fairly straightforward, and the corresponding classes have been marked as deprecated. For +end users it should be simply a matter of changing the class name. +- If you perform a `concat` of a `string_view` and `string`, it will now return a +`string_view` instead of a `string`. This likely only impacts unit tests that are validating +return types. In general, it is recommended to switch to using `string_view` whenever +possible. You can see the blog articles [String View Pt 1] and [Pt 2] for more information +on these performance improvements. +- The function `date_part` now returns an `int32` instead of a `float64`. This is likely +only impactful to unit tests. +- We have upgraded the Python minimum version to 3.9 since 3.8 is no longer officially +supported. + +[DataFusion 43.0.0]: https://github.com/apache/datafusion/blob/main/dev/changelog/43.0.0.md +[DataFusion 44.0.0]: https://github.com/apache/datafusion/blob/main/dev/changelog/44.0.0.md +[String View Pt 1]: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/ +[Pt 2]: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/ + +## Coming Soon + +There is a lot of excitement around the upcoming work. This list is not comprehensive, but +a glimpse into some of the upcoming work includes: + +- Reusable DataFusion UDFs: The way user defined functions are currently written in +`datafusion-python` is slightly different from those written for the upstream Rust +`datafusion`. The core ideas are usually the same, but it means it takes effort for users +to re-implement functions already written for Rust projects to be usable in Python. Issue +[#1017] addresses this topic. Work is well underway to make it easier to expose these +user functions through the FFI boundary. This means that the work that already exists in +repositories such as those found in the [datafusion-contrib] project can be easily +re-used in Python. This will provide a low effort way to expose significant functionality +to the DataFusion in Python community. +- Additional table providers: We have work well underway to provide a host of table providers +to `datafusion-python` including: sqlite, duckdb, postgres, odbc, and mysql! In +[datafusion-contrib #279] we track the progress of this excellent work. Once complete, users +will be able to `pip install` this library and get easy access to all of these table +providers. This is another way we are leveraging the FFI work to greatly expand the usability +of `datafusion-python` with relatively low effort. +- External catalog and schema providers: For users who wish to go beyond table providers +and have an entire custom catalog with schema, Issue [#1091] tracks the progress of exposing +this in Python. With this work, if you have already written a Rust based table catalog you +will be able to interface it in Python similar to the work described for the table +providers above. + +This is only a sample of the great work that is being done. If there are features you would +love to see, we encourage you to open an issue and join us as we build something wonderful. + +[#1017]: https://github.com/apache/datafusion-python/issues/1017 +[datafusion-contrib #279]: https://github.com/datafusion-contrib/datafusion-table-providers/issues/279 +[#1091]: https://github.com/apache/datafusion-python/issues/1091 +[datafusion-contrib]: https://github.com/datafusion-contrib + +## Appreciation + +We would like to thank everyone who has helped with these releases through their helpful +conversations, code review, issue descriptions, and code authoring. We would especially +like to thank the following authors of PRs who made these releases possible, listed in +alphabetical order by username: [@chenkovsky], [@CrystalZhou0529], [@ion-elgreco], Review Comment: FYI @chenkovsky, @CrystalZhou0529, @ion-elgreco, @jsai28, @kevinjqliu, @kylebarron, @kosiew, @nirnayroy, and @Spaarsh ########## content/blog/2025-03-30-datafusion-python-46.0.0.md: ########## @@ -0,0 +1,300 @@ +--- +layout: post +title: Apache DataFusion Python 46.0.0 Released +date: 2025-03-30 +author: timsaucer +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + + +We are happy to announce that [datafusion-python 46.0.0] has been released. This release +brings in all of the new features of the core [DataFusion 46.0.0] library. Since the last +blog post for [datafusion-python 43.1.0], a large number of improvements have been made +that can be found in the [changelogs]. + +We highly recommend reviewing the upstream [DataFusion 46.0.0] announcement. + +[DataFusion 46.0.0]: https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0 +[datafusion-python 43.1.0]: https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/ +[datafusion-python 46.0.0]: https://pypi.org/project/datafusion/46.0.0/ +[changelogs]: https://github.com/apache/datafusion-python/tree/main/dev/changelog + +## Easier file reading + +In these releases we have introduced two new ways to more easily read files into +DataFrames. + +PR [#982] introduced a series of easier read functions for Parquet, JSON, CSV, and +AVRO files. This introduces a concept of a global context that is available by +default when using these methods. Now instead of creating a default Session +Context and then calling the read methods, you can simply import these read +alternative methods and begin working with your DataFrames. Below is an example of +how easy to use this new approach is. + +```python +from datafusion.io import read_parquet +df = read_parquet(path="./examples/tpch/data/customer.parquet") +``` + +PR [#980] adds a method for setting up a session context to use URL tables. With +this enabled, you can use a path to a local file as a table name. An example +of how to use this is demonstrated in the following snippet. + +```python +import datafusion +ctx = datafusion.SessionContext().enable_url_table() Review Comment: FYI @goldmedal (this is exposing your great work via datafusion-python 🐱 🎉 ) ########## content/blog/2025-03-30-datafusion-python-46.0.0.md: ########## @@ -0,0 +1,300 @@ +--- +layout: post +title: Apache DataFusion Python 46.0.0 Released +date: 2025-03-30 +author: timsaucer +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + + +We are happy to announce that [datafusion-python 46.0.0] has been released. This release +brings in all of the new features of the core [DataFusion 46.0.0] library. Since the last +blog post for [datafusion-python 43.1.0], a large number of improvements have been made +that can be found in the [changelogs]. + +We highly recommend reviewing the upstream [DataFusion 46.0.0] announcement. + +[DataFusion 46.0.0]: https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0 +[datafusion-python 43.1.0]: https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/ +[datafusion-python 46.0.0]: https://pypi.org/project/datafusion/46.0.0/ +[changelogs]: https://github.com/apache/datafusion-python/tree/main/dev/changelog + +## Easier file reading + +In these releases we have introduced two new ways to more easily read files into +DataFrames. + +PR [#982] introduced a series of easier read functions for Parquet, JSON, CSV, and +AVRO files. This introduces a concept of a global context that is available by +default when using these methods. Now instead of creating a default Session +Context and then calling the read methods, you can simply import these read +alternative methods and begin working with your DataFrames. Below is an example of +how easy to use this new approach is. + +```python +from datafusion.io import read_parquet +df = read_parquet(path="./examples/tpch/data/customer.parquet") +``` + +PR [#980] adds a method for setting up a session context to use URL tables. With +this enabled, you can use a path to a local file as a table name. An example +of how to use this is demonstrated in the following snippet. + +```python +import datafusion +ctx = datafusion.SessionContext().enable_url_table() +df = ctx.table("./examples/tpch/data/customer.parquet") +``` + +[#982]: https://github.com/apache/datafusion-python/pull/982 +[#980]: https://github.com/apache/datafusion-python/pull/980 + +## Registering Table Views + +DataFusion supports registering a logical plan as a view with a session context. This +allows for work flows to create views in one part of the work flow and pass the session +context around to other places where that logical plan can be reused. This is an useful +feature for building up complex workflows and for code clarity. PR [#1016] enables this +feature in `datafusion-python`. + +For example, supposing you have a DataFrame called `df1`, you could use this code snippet +to register the view and then use it in another place: + +```python +ctx.register_view("view1", df1) +``` + +And then in another portion of your code which has access to the same session context +you can retrive the DataFrame with: + +``` +df2 = ctx.table("view1") +``` + +[#1016]: https://github.com/apache/datafusion-python/pull/1016 + +## Asynchronous Iteration of Record Batches + +Retrieving a `RecordBatch` from a `RecordBatchStream` was a synchronous call, which would +require the end user's code to wait for the data retrieval. This is described in +[Issue 974]. We continue to support this as a synchronous iterator, but we have also added +in the ability to retrieve the `RecordBatch` using the Python asynchronous `anext` +function. + +[Issue 974]: https://github.com/apache/datafusion-python/issues/974 + +## Default Compression for Parquet files + +With PR [#981], we change the saving of Parquet files to use zstd compression by default. +Previously the default was uncompressed, causing excessive disk storage. Zstd is an +excellent compression scheme that balances speed and compression ratio. Users can still +save their Parquet files uncompressed by passing in the appropriate value to the +`compression` argument when calling `DataFrame.write_parquet`. + +[#981]: https://github.com/apache/datafusion-python/pull/981 + +## UDF Decorators + +In PRs [#1040] and [#1061] we add methods to make creating user defined functions +easier and take advantage of Python decorators. With these PRs you can save a step +from defining a method and then defining a udf of that method. Instead you can +simply add the appropriate `udf` decorator. Similar methods exist for aggregate +and window user defined functions. + +```python +@udf([pa.int64(), pa.int64()], pa.bool_(), "stable") +def my_custom_function( + age: pa.Array, + favorite_number: pa.Array, +) -> pa.Array: + pass +``` + +[#1040]: https://github.com/apache/datafusion-python/pull/1040 +[#1061]: https://github.com/apache/datafusion-python/pull/1061 + + +## `uv` package management + +[uv] is an extremely fast Python package manager, written in Rust. In the previous version +of `datafusion-python` we had a combination of settings of PyPi and Conda. Instead, we +switch to using [uv] is our primary method for dependency management. + +For most users of DataFusion, this change will be transparent. You can still install +via `pip` or `conda`. For developers, the instructions in the repository have been updated. + +[uv]: https://github.com/astral-sh/uv + +## Code cleanup + +In an effort to improve our code cleanliness and ensure we are following Python best +practices, we use [ruff] to perform Python linting. Until now we enabled only a portion +of the available linters available. In PRs [#1055] and [#1062], we enable many more +of these linters and made code improvements to ensure we are following these +recommendations. + +[ruff]: https://docs.astral.sh/ruff/ +[#1055]: https://github.com/apache/datafusion-python/pull/1055 +[#1062]: https://github.com/apache/datafusion-python/pull/1062 + +## Improved Jupyter Notebook rendering Review Comment: this is really cool ########## content/blog/2025-03-30-datafusion-python-46.0.0.md: ########## @@ -0,0 +1,300 @@ +--- +layout: post +title: Apache DataFusion Python 46.0.0 Released +date: 2025-03-30 +author: timsaucer +categories: [release] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + + +We are happy to announce that [datafusion-python 46.0.0] has been released. This release +brings in all of the new features of the core [DataFusion 46.0.0] library. Since the last +blog post for [datafusion-python 43.1.0], a large number of improvements have been made +that can be found in the [changelogs]. + +We highly recommend reviewing the upstream [DataFusion 46.0.0] announcement. + +[DataFusion 46.0.0]: https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0 +[datafusion-python 43.1.0]: https://datafusion.apache.org/blog/2024/12/14/datafusion-python-43.1.0/ +[datafusion-python 46.0.0]: https://pypi.org/project/datafusion/46.0.0/ +[changelogs]: https://github.com/apache/datafusion-python/tree/main/dev/changelog + +## Easier file reading + +In these releases we have introduced two new ways to more easily read files into +DataFrames. + +PR [#982] introduced a series of easier read functions for Parquet, JSON, CSV, and +AVRO files. This introduces a concept of a global context that is available by +default when using these methods. Now instead of creating a default Session +Context and then calling the read methods, you can simply import these read +alternative methods and begin working with your DataFrames. Below is an example of +how easy to use this new approach is. + +```python +from datafusion.io import read_parquet +df = read_parquet(path="./examples/tpch/data/customer.parquet") +``` + +PR [#980] adds a method for setting up a session context to use URL tables. With +this enabled, you can use a path to a local file as a table name. An example +of how to use this is demonstrated in the following snippet. + +```python +import datafusion +ctx = datafusion.SessionContext().enable_url_table() +df = ctx.table("./examples/tpch/data/customer.parquet") +``` + +[#982]: https://github.com/apache/datafusion-python/pull/982 +[#980]: https://github.com/apache/datafusion-python/pull/980 + +## Registering Table Views + +DataFusion supports registering a logical plan as a view with a session context. This +allows for work flows to create views in one part of the work flow and pass the session +context around to other places where that logical plan can be reused. This is an useful +feature for building up complex workflows and for code clarity. PR [#1016] enables this +feature in `datafusion-python`. + +For example, supposing you have a DataFrame called `df1`, you could use this code snippet +to register the view and then use it in another place: + +```python +ctx.register_view("view1", df1) +``` + +And then in another portion of your code which has access to the same session context +you can retrive the DataFrame with: + +``` +df2 = ctx.table("view1") +``` + +[#1016]: https://github.com/apache/datafusion-python/pull/1016 + +## Asynchronous Iteration of Record Batches + +Retrieving a `RecordBatch` from a `RecordBatchStream` was a synchronous call, which would +require the end user's code to wait for the data retrieval. This is described in +[Issue 974]. We continue to support this as a synchronous iterator, but we have also added +in the ability to retrieve the `RecordBatch` using the Python asynchronous `anext` +function. + +[Issue 974]: https://github.com/apache/datafusion-python/issues/974 + +## Default Compression for Parquet files Review Comment: ```suggestion ## Default ZSTD Compression for Parquet files ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org