Hi, Rahi and Jack: Thanks for raising this.
My question is that the pagination and sharding will make the rest server stateful, e.g. a sequence of calls is required to go to the same server. In this case, how do we ensure the scalability of the rest server? On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org> wrote: > Hey Rahil and Jack, > > Thanks for bringing this up. Ryan and I also discussed this briefly in the > early days of PyIceberg and it would have helped a lot in the speed of > development. We went for the traditional approach because that would also > support all the other catalogs, but now that the REST catalog is taking > off, I think it still makes a lot of sense to get it in. > > I do share the concern raised Ryan around the concepts of shards and > pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are > living in a single process today the concept of shards doesn't add value. I > see your concern with long-running jobs, but for the non-distributed cases, > it will add additional complexity. > > Some suggestions that come to mind: > > - Stream the tasks directly back using a chunked response, reducing > the latency to the first task. This would also solve things with the > pagination. The only downside I can think of is having delete files where > you first need to make sure there are deletes relevant to the task, this > might increase latency to the first task. > - Making the sharding optional. If you want to shard you call the > CreateScan first and then call the GetScanTask with the IDs. If you don't > want to shard, you omit the shard parameter and fetch the tasks directly > (here we need also replace the scan string with the full > column/expression/snapshot-id etc). > > Looking forward to discussing this tomorrow in the community sync > <https://iceberg.apache.org/community/#iceberg-community-events>! > > Kind regards, > Fokko > > > > Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>: > >> Hi Ryan, thanks for the feedback! >> >> I was a part of this design discussion internally and can provide more >> details. One reason for separating the CreateScan operation was to make the >> API asynchronous and thus keep HTTP communications short. Consider the case >> where we only have GetScanTasks API, and there is no shard specified. It >> might take tens of seconds, or even minutes to read through all the >> manifest list and manifests before being able to return anything. This >> means the HTTP connection has to remain open during that period, which is >> not really a good practice in general (consider connection failure, load >> balancer and proxy load, etc.). And when we shift the API to asynchronous, >> it basically becomes something like the proposal, where a stateful ID is >> generated to be able to immediately return back to the client, and the >> client get results by referencing the ID. So in our current prototype >> implementation we are actually keeping this ID and the whole REST service >> is stateful. >> >> There were some thoughts we had about the possibility to define a "shard >> ID generator" protocol: basically the client agrees with the service a way >> to deterministically generate shard IDs, and service uses it to create >> shards. That sounds like what you are suggesting here, and it pushes the >> responsibility to the client side to determine the parallelism. But in some >> bad cases (e.g. there are many delete files and we need to read all those >> in each shard to apply filters), it seems like there might still be the >> long open connection issue above. What is your thought on that? >> >> -Jack >> >> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote: >> >>> Rahil, thanks for working on this. It has some really good ideas that we >>> hadn't considered before like a way for the service to plan how to break up >>> the work of scan planning. I really like that idea because it makes it much >>> easier for the service to keep memory consumption low across requests. >>> >>> My primary feedback is that I think it's a little too complicated (with >>> both sharding and pagination) and could be modified slightly so that the >>> service doesn't need to be stateful. If the service isn't necessarily >>> stateful then it should be easier to build implementations. >>> >>> To make it possible for the service to be stateless, I'm proposing that >>> rather than creating shard IDs that are tracked by the service, the >>> information for a shard can be sent to the client. My assumption here is >>> that most implementations would create shards by reading the manifest list, >>> filtering on partition ranges, and creating a shard for some reasonable >>> size of manifest content. For example, if a table has 100MB of metadata in >>> 25 manifests that are about 4 MB each, then it might create 9 shards with >>> 1-4 manifests each. The service could send those shards to the client as a >>> list of manifests to read and the client could send the shard information >>> back to the service to get the data files in each shard (along with the >>> original filter). >>> >>> There's a slight trade-off that the protocol needs to define how to >>> break the work into shards. I'm interested in hearing if that would work >>> with how you were planning on building the service on your end. Another >>> option is to let the service send back arbitrary JSON that would get >>> returned for each shard. Either way, I like that this would make it so the >>> service doesn't need to persist anything. We could also make it so that >>> small tables don't require multiple requests. For example, a client could >>> call the route to get file tasks with just a filter. >>> >>> What do you think? >>> >>> Ryan >>> >>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil >>> <rcher...@amazon.com.invalid> wrote: >>> >>>> Hi all, >>>> >>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at Amazon >>>> EMR and Athena. I’m reaching out to share a proposal for a new Scan API >>>> that will be utilized by the RESTCatalog. The process for table scan >>>> planning is currently done within client engines such as Apache Spark. By >>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg >>>> table scans with external services, which can lead to several benefits. >>>> >>>> For example, we can leverage caching and indexes on the server side to >>>> improve planning performance. Furthermore, by moving this scan logic to the >>>> RESTCatalog, non-JVM engines can integrate more easily. This all can be >>>> found in the detailed proposal below. Feel free to comment, and add your >>>> suggestions . >>>> >>>> Detailed proposal: >>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h >>>> >>>> Github POC: https://github.com/apache/iceberg/pull/9252 >>>> >>>> Regards, >>>> >>>> Rahil Chertara >>>> Amazon EMR & Athena >>>> rcher...@amazon.com >>>> >>>> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>