goldmedal commented on issue #12394: URL: https://github.com/apache/datafusion/issues/12394#issuecomment-2366246636
> I wonder if we could change ListingTable::infer and anything else that uses `SessionState` directly to take a trait instance `&dyn Session`: https://docs.rs/datafusion/latest/datafusion/catalog/trait.Session.html > > That might then permit avoiding the need for an enture SessionContext 🤔 After some research, I found the main usages of SessionState by `ListTable::infer` are `runtime_env` and `config_options`. If we want to use `dyn Session`, we must move the API `SessionState::get_file_format_factory` to `dyn Session`. It may involve `FileFormat` and `FileFormatFactory`, which this API requires. ```RUst /// Retrieves a [FileFormatFactory] based on file extension which has been registered /// via SessionContext::register_file_format. Extensions are not case sensitive. pub fn get_file_format_factory( &self, ext: &str, ) -> Option<Arc<dyn FileFormatFactory>> { self.file_formats.get(&ext.to_lowercase()).cloned() } ``` The `format.infer_schema` API also needs `SessionState` but I found it only be used by parquet and csv format actually, and then they require config options only. I guess that it can be replaced by `dyn Session` easily. ```rust /// Infer the common schema of the provided objects. The objects will usually /// be analysed up to a given number of records or files (as specified in the /// format config) then give the estimated common schema. This might fail if /// the files have schemas that cannot be merged. async fn infer_schema( &self, state: &SessionState, store: &Arc<dyn ObjectStore>, objects: &[ObjectMeta], ) -> Result<SchemaRef>; ``` ## Runtime_env It's used by `ListTableUrl::list_all_files`: ```rust let list = match self.is_collection() { true => match ctx.runtime_env().cache_manager.get_list_files_cache() { None => store.list(Some(&self.prefix)), Some(cache) => { if let Some(res) = cache.get(&self.prefix) { debug!("Hit list all files cache"); futures::stream::iter(res.as_ref().clone().into_iter().map(Ok)) .boxed() } else { let list_res = store.list(Some(&self.prefix)); let vec = list_res.try_collect::<Vec<ObjectMeta>>().await?; cache.put(&self.prefix, Arc::new(vec.clone())); futures::stream::iter(vec.into_iter().map(Ok)).boxed() } } }, false => futures::stream::once(store.head(&self.prefix)).boxed(), }; ``` # Some Conclusions If we can use `dyn Session` instead, we only need to take the following elements for the implementation: - file_formats - config_option - runtime_env The building of a dynamic catalog would be: ```rust let runtime = Arc::new(RuntimeEnv::default()); // DynamicSession is an implementation of `Session`. let factory = Arc::new(DynamicListTableFactory::new(DynamicSession::new(file_formats, config_options, runtime))); let catalog_list = Arc::new(DynamicFileCatalog::new( Arc::clone(state_ref.catalog_list()), Arc::clone(&factory) as Arc<dyn UrlTableFactory>, )); ``` However, the file_formats and configs will be static. We can't register additional formats or change configs at the runtime. (if we want to change them at the runtime, we need to make them be something like share reference 🤔 ) Then, we can remove `SessionStore` to avoid the issue of registering the current state after building SessionState. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
