linhr opened a new pull request, #15250:
URL: https://github.com/apache/datafusion/pull/15250

   ## Which issue does this PR close?
   
   N/A
   
   ## Rationale for this change
   
   I spent some time looking into #7393. It seems the simple cases can be 
supported in a few lines of code (e.g. parsing `s3://bucket/key/*.parquet` into 
a base URL `s3://bucket/key/` and a glob pattern `*.parquet`. But soon I 
realized there is no clear idea how broader cases can be handled.
   1. For `http` or `https` schemes, glob patterns in the URL do not make 
sense, since the HTTP object store cannot list files under a given directory. 
Also, the `?` character before the query should not be treated as a glob 
pattern.
   2. Special characters (`*`, `?`, etc.) in the authority (host name, 
username, or password) probably should not be treated as glob patterns. 
However, if we want to support glob patterns in URL paths only, we face a 
difficulty that the parsed URL does not provide access to the raw path 
(<https://github.com/servo/rust-url/issues/602>). So we cannot recover the glob 
pattern (before percentage escape) in the original URL string, when using the 
`url` crate.
   3. Some applications have different glob syntax. For example, Spark uses `^` 
instead of `!` for negated character class, and it also supports alternation 
(e.g. `{a,b}`) which is not supported by the `glob` crate. I'm not sure if we 
want this kind of capabilities in DataFusion, since the expected behavior for 
URL glob patterns may be different in downstream projects.
   
   Given the analysis above, I feel the best workaround for now is to allow 
applications to construct `ListingTableUrl` directly from a base URL (with no 
interpretation of glob) and an optional glob pattern. How the base URL and the 
glob pattern are parsed is determined by the application before constructing 
`ListingTableUrl`.
   
   ## What changes are included in this PR?
   
   `ListingTableUrl::try_new` is now public. I also updated its documentation 
to explain when this method could be useful.
   
   ## Are these changes tested?
   
   N/A
   
   ## Are there any user-facing changes?
   
   `ListingTableUrl::try_new` is now public. This change is backward compatible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to