alamb commented on code in PR #11035: URL: https://github.com/apache/datafusion/pull/11035#discussion_r1720948344
########## datafusion/catalog/src/dynamic_file/catalog.rs: ########## @@ -0,0 +1,250 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! dynamic_file contains [`DynamicFileCatalog`] that creates tables from file paths +//! if the wrapped [`CatalogProviderList`] doesn't have the table provider. + +use crate::{CatalogProvider, CatalogProviderList, SchemaProvider, TableProvider}; +use async_trait::async_trait; +#[cfg(feature = "home_dir")] +use dirs::home_dir; +use std::any::Any; +use std::sync::Arc; + +/// Wrap another catalog provider list +pub struct DynamicFileCatalog { Review Comment: I wonder if it would make more sense to call this DynamicFileCatalog with the idea that eventually it would support arbitrary urls, not just local files? ########## datafusion/sqllogictest/src/test_context.rs: ########## @@ -91,13 +91,18 @@ impl TestContext { { info!("Registering avro tables"); register_avro_tables(&mut test_ctx).await; + test_ctx.ctx = test_ctx.ctx.enable_url_table(); } #[cfg(not(feature = "avro"))] { info!("Skipping {file_name} because avro feature is not enabled"); return None; } } + "describe.slt" | "arrow_files.slt" | "csv_files.slt" | "json.slt" Review Comment: I think it would help to point out some rationale ```suggestion // For (only) these files, enable the dynamic file catalog "describe.slt" | "arrow_files.slt" | "csv_files.slt" | "json.slt" ``` ########## README.md: ########## @@ -91,6 +91,7 @@ Optional features: - `backtrace`: include backtrace information in error messages - `pyarrow`: conversions between PyArrow and DataFusion types - `serde`: enable arrow-schema's `serde` feature +- `home_dir` : enable support for substituting the tilde character in the file path with the user home directory for the URL table Review Comment: I think an optional feature makes sense Another alternative would be to avoid the dependency entirely. by leaving tilde expansion in datafusion-cli and an API in the core to call it ########## datafusion-cli/src/catalog.rs: ########## @@ -115,13 +114,14 @@ impl CatalogProvider for DynamicFileCatalogProvider { } } -/// Wraps another schema provider -struct DynamicFileSchemaProvider { +/// Wraps another schema provider. [DynamicObjectStoreSchemaProvider] is responsible for registering the required Review Comment: I am trying to understand why datafusion-cli can't simply use `DynamicFileCatalog` -- why does it need another layer of wrapping? Is it the need to dynamically create `ObjectStore`s? I have found that dynamic creation to be one of the more complicated things about datafusion-cli (and what users of DataFusion would have to figure out to recreate it). Maybe we can figure out some simpler API for that too (as another PR) ########## datafusion/core/src/datasource/dynamic_file.rs: ########## @@ -0,0 +1,79 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! dynamic_file_schema contains an [`UrlTableFactory`] implementation that +//! can create a [`ListingTable`] from the given url. + +use std::sync::Arc; + +use async_trait::async_trait; +use datafusion_catalog::{SessionStore, UrlTableFactory}; +use datafusion_common::plan_datafusion_err; + +use crate::datasource::listing::{ListingTable, ListingTableConfig, ListingTableUrl}; +use crate::datasource::TableProvider; +use crate::error::Result; +use crate::execution::context::SessionState; + +/// [DynamicListTableFactory] is a factory that can create a [ListingTable] from the given url. +#[derive(Default)] +pub struct DynamicListTableFactory { + /// The session store that contains the current session. + session_store: Arc<SessionStore>, +} + +impl DynamicListTableFactory { + /// Create a new [DynamicListTableFactory] with the given state store. + pub fn new(session_store: Arc<SessionStore>) -> Self { + Self { session_store } + } + + fn session_store(&self) -> Arc<SessionStore> { + Arc::clone(&self.session_store) + } +} + +#[async_trait] +impl UrlTableFactory for DynamicListTableFactory { + async fn try_new(&self, url: &str) -> Result<Option<Arc<dyn TableProvider>>> { + let Ok(table_url) = ListingTableUrl::parse(url) else { + return Ok(None); + }; + + let state = &self + .session_store() + .get_session() + .upgrade() + .and_then(|session| { + session + .read() + .as_any() + .downcast_ref::<SessionState>() Review Comment: Yeah I agree another PR might make more sense. Perhaps since SessionStore may not be needed longer term, we could simply have the field directly hold the session ```rust session: Arc<Mutex<Option<Weak<RwLock<dyn Session>>>>>; ``` ########## datafusion/sqllogictest/test_files/arrow_files.slt: ########## @@ -43,6 +43,14 @@ SELECT * FROM arrow_simple 3 baz false 4 NULL true +query ITB Review Comment: I recommend creating a new `.slt` file explicitly for dynamic file catalog tests rather than extending the existing tests. This would: 1. Help show that the behavior isn't changed except when dynamic catalog is enabled 2. make it easier to evaluate how much dynamic file catalog testing there was Also, is there sufficient negative testing? Specifically that the dynamic catalog provider isn't enabled by default (which would be a security hole?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
