linhr opened a new pull request, #15250: URL: https://github.com/apache/datafusion/pull/15250
## Which issue does this PR close? N/A ## Rationale for this change I spent some time looking into #7393. It seems the simple cases can be supported in a few lines of code (e.g. parsing `s3://bucket/key/*.parquet` into a base URL `s3://bucket/key/` and a glob pattern `*.parquet`. But soon I realized there is no clear idea how broader cases can be handled. 1. For `http` or `https` schemes, glob patterns in the URL do not make sense, since the HTTP object store cannot list files under a given directory. Also, the `?` character before the query should not be treated as a glob pattern. 2. Special characters (`*`, `?`, etc.) in the authority (host name, username, or password) probably should not be treated as glob patterns. However, if we want to support glob patterns in URL paths only, we face a difficulty that the parsed URL does not provide access to the raw path (<https://github.com/servo/rust-url/issues/602>). So we cannot recover the glob pattern (before percentage escape) in the original URL string, when using the `url` crate. 3. Some applications have different glob syntax. For example, Spark uses `^` instead of `!` for negated character class, and it also supports alternation (e.g. `{a,b}`) which is not supported by the `glob` crate. I'm not sure if we want this kind of capabilities in DataFusion, since the expected behavior for URL glob patterns may be different in downstream projects. Given the analysis above, I feel the best workaround for now is to allow applications to construct `ListingTableUrl` directly from a base URL (with no interpretation of glob) and an optional glob pattern. How the base URL and the glob pattern are parsed is determined by the application before constructing `ListingTableUrl`. ## What changes are included in this PR? `ListingTableUrl::try_new` is now public. I also updated its documentation to explain when this method could be useful. ## Are these changes tested? N/A ## Are there any user-facing changes? `ListingTableUrl::try_new` is now public. This change is backward compatible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org