westonpace opened a new pull request, #13582:
URL: https://github.com/apache/datafusion/pull/13582

   ## Which issue does this PR close?
   
   Closes #10339 
   
   ## Rationale for this change
   
   Many catalogs are remote (and/or disk based) and offer only asynchronous 
APIs.  For example, [Polaris](https://github.com/apache/polaris), 
[Unity](https://github.com/unitycatalog/unitycatalog), and 
[Hive](https://hive.apache.org/).  Integrating with this catalogs is impossible 
since something like `ctx.sql("SELECT * FROM db.schm.tbl")` first enters an 
async context (`sql`) then a synchronous context (calling the catalog provider 
to resolve `db`) and then we need to go into an asynchronous context to 
interact with the catalog and this async -> sync -> async path is generally 
forbidden.
   
   ## What changes are included in this PR?
   
   The heart of the change is making all non-trivial methods async in 
`CatalogProvider`, `SchemaProvider`, and `CatalogProviderList`.
   
   These changes had rather far-reaching ramifications discussed more below.
   
   ## Are these changes tested?
   
   Yes, in the sense that these traits were all tested previously.  I did not 
add any new testing.
   
   ## Are there any user-facing changes?
   
   Yes, there are significant user-facing changes beyond the obvious change 
that these public traits are now async.
   
   ### Notable but expected
   
   The following methods are now async and were previously sync
   
   ```
   SessionContext::register_catalog
   SessionContext::catalog_names
   SessionContext::catalog
   SessionContext::register_table
   SessionContext::deregister_table
   SessionContext::table_exist
   ```
   
   ### Perhaps surprising
   
   The `SessionStateBuilder::build` method and 
`SessionStateBuilder::new_from_existing` methods are now async and the `From` 
impls to go between `SessionState` and `SessionStateBuilder` were removed.
   
   The `new_from_existing` change is because the method does a (now async) 
lookup into the existing catalog list (which may be remote) to determine if a 
default catalog exists so that it knows if it needs to create a new one or not.
   
   The `build` change is because, if no default catalog exists, then a default 
catalog is created and registered with the (potentially remote) catalog list.
   
   The `SessionContext::register_batch` and `SessionContext::register_table` 
methods are used frequently and it may make sense to think of them as 
synchronous since in-memory tables and batches are not typically thought of as 
something a "catalog" provides.  It would be possible for `SessionContext` to 
have both "a catalog" (which is async) and "a collection of in-memory 
session-specific tables" (which is sync) and thus keep these methods 
synchronous.  However, that felt like more complexity than justified by this 
change.
   
   ### Caveats
   
   In most cases I simply propagated the async.  In some benchmarks I am using 
`now_or_never().expect()` to avoid the need to create a tokio runtime.
   
   In a few of the `SessionContext` factory functions, where a custom catalog 
cannot possibly be provided (e.g. `SessionContext::default`) I use 
`now_or_never` since it is safe to assume the default in-memory catalog is 
synchronous.
   
   ### Details for review
   
   The only non-trivial implementation changes were in `SessionState`.  There 
is a `RwLock` there and it required some fiddling (and a few Arc clones of the 
catalog list) to avoid holding the lock across an await boundary.  This could 
be potentially simplified by changing the `RwLock` to an async lock but I'm not 
sure that's justified yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to