goldmedal commented on issue #12394:
URL: https://github.com/apache/datafusion/issues/12394#issuecomment-2366246636

   > I wonder if we could change ListingTable::infer and anything else that 
uses `SessionState` directly to take a trait instance `&dyn Session`: 
https://docs.rs/datafusion/latest/datafusion/catalog/trait.Session.html
   > 
   > That might then permit avoiding the need for an enture SessionContext 🤔
   
   After some research, I found the main usages of SessionState by 
`ListTable::infer` are `runtime_env` and `config_options`. 
   
   If we want to use `dyn Session`, we must move the API 
`SessionState::get_file_format_factory` to `dyn Session`. It may involve 
`FileFormat` and `FileFormatFactory`, which this API requires. 
   
   ```RUst
       /// Retrieves a [FileFormatFactory] based on file extension which has 
been registered
       /// via SessionContext::register_file_format. Extensions are not case 
sensitive.
       pub fn get_file_format_factory(
           &self,
           ext: &str,
       ) -> Option<Arc<dyn FileFormatFactory>> {
           self.file_formats.get(&ext.to_lowercase()).cloned()
       }
   ```
   
   
   The `format.infer_schema` API also needs `SessionState` but I found it only 
be used by parquet and csv format actually, and then they require config 
options only. I guess that it can be replaced by `dyn Session` easily. 
   
   ```rust
       /// Infer the common schema of the provided objects. The objects will 
usually
       /// be analysed up to a given number of records or files (as specified 
in the
       /// format config) then give the estimated common schema. This might 
fail if
       /// the files have schemas that cannot be merged.
       async fn infer_schema(
           &self,
           state: &SessionState,
           store: &Arc<dyn ObjectStore>,
           objects: &[ObjectMeta],
       ) -> Result<SchemaRef>;
   ```
   
   ## Runtime_env
   It's used by `ListTableUrl::list_all_files`:
   ```rust
           let list = match self.is_collection() {
               true => match 
ctx.runtime_env().cache_manager.get_list_files_cache() {
                   None => store.list(Some(&self.prefix)),
                   Some(cache) => {
                       if let Some(res) = cache.get(&self.prefix) {
                           debug!("Hit list all files cache");
                           
futures::stream::iter(res.as_ref().clone().into_iter().map(Ok))
                               .boxed()
                       } else {
                           let list_res = store.list(Some(&self.prefix));
                           let vec = 
list_res.try_collect::<Vec<ObjectMeta>>().await?;
                           cache.put(&self.prefix, Arc::new(vec.clone()));
                           
futures::stream::iter(vec.into_iter().map(Ok)).boxed()
                       }
                   }
               },
               false => futures::stream::once(store.head(&self.prefix)).boxed(),
           };
   ```
   
   # Some Conclusions
   If we can use `dyn Session` instead, we only need to take the following 
elements for the implementation:
   - file_formats
   - config_option
   - runtime_env
   
   The building of a dynamic catalog would be:
   ```rust
           let runtime = Arc::new(RuntimeEnv::default());
           // DynamicSession is an implementation of `Session`.
           let factory = 
Arc::new(DynamicListTableFactory::new(DynamicSession::new(file_formats, 
config_options, runtime)));
           let catalog_list = Arc::new(DynamicFileCatalog::new(
               Arc::clone(state_ref.catalog_list()),
               Arc::clone(&factory) as Arc<dyn UrlTableFactory>,
           ));
   ```
   However, the file_formats and configs will be static. We can't register 
additional formats or change configs at the runtime. (if we want to change them 
at the runtime, we need to make them be something like share reference 🤔 ) 
Then, we can remove `SessionStore` to avoid the issue of registering the 
current state after building SessionState. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to