Hey Rahil and Jack,

Thanks for bringing this up. Ryan and I also discussed this briefly in the
early days of PyIceberg and it would have helped a lot in the speed of
development. We went for the traditional approach because that would also
support all the other catalogs, but now that the REST catalog is taking
off, I think it still makes a lot of sense to get it in.

I do share the concern raised Ryan around the concepts of shards and
pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
living in a single process today the concept of shards doesn't add value. I
see your concern with long-running jobs, but for the non-distributed cases,
it will add additional complexity.

Some suggestions that come to mind:

   - Stream the tasks directly back using a chunked response, reducing the
   latency to the first task. This would also solve things with the
   pagination. The only downside I can think of is having delete files where
   you first need to make sure there are deletes relevant to the task, this
   might increase latency to the first task.
   - Making the sharding optional. If you want to shard you call the
   CreateScan first and then call the GetScanTask with the IDs. If you don't
   want to shard, you omit the shard parameter and fetch the tasks directly
   (here we need also replace the scan string with the full
   column/expression/snapshot-id etc).

Looking forward to discussing this tomorrow in the community sync
<https://iceberg.apache.org/community/#iceberg-community-events>!

Kind regards,
Fokko



Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>:

> Hi Ryan, thanks for the feedback!
>
> I was a part of this design discussion internally and can provide more
> details. One reason for separating the CreateScan operation was to make the
> API asynchronous and thus keep HTTP communications short. Consider the case
> where we only have GetScanTasks API, and there is no shard specified. It
> might take tens of seconds, or even minutes to read through all the
> manifest list and manifests before being able to return anything. This
> means the HTTP connection has to remain open during that period, which is
> not really a good practice in general (consider connection failure, load
> balancer and proxy load, etc.). And when we shift the API to asynchronous,
> it basically becomes something like the proposal, where a stateful ID is
> generated to be able to immediately return back to the client, and the
> client get results by referencing the ID. So in our current prototype
> implementation we are actually keeping this ID and the whole REST service
> is stateful.
>
> There were some thoughts we had about the possibility to define a "shard
> ID generator" protocol: basically the client agrees with the service a way
> to deterministically generate shard IDs, and service uses it to create
> shards. That sounds like what you are suggesting here, and it pushes the
> responsibility to the client side to determine the parallelism. But in some
> bad cases (e.g. there are many delete files and we need to read all those
> in each shard to apply filters), it seems like there might still be the
> long open connection issue above. What is your thought on that?
>
> -Jack
>
> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote:
>
>> Rahil, thanks for working on this. It has some really good ideas that we
>> hadn't considered before like a way for the service to plan how to break up
>> the work of scan planning. I really like that idea because it makes it much
>> easier for the service to keep memory consumption low across requests.
>>
>> My primary feedback is that I think it's a little too complicated (with
>> both sharding and pagination) and could be modified slightly so that the
>> service doesn't need to be stateful. If the service isn't necessarily
>> stateful then it should be easier to build implementations.
>>
>> To make it possible for the service to be stateless, I'm proposing that
>> rather than creating shard IDs that are tracked by the service, the
>> information for a shard can be sent to the client. My assumption here is
>> that most implementations would create shards by reading the manifest list,
>> filtering on partition ranges, and creating a shard for some reasonable
>> size of manifest content. For example, if a table has 100MB of metadata in
>> 25 manifests that are about 4 MB each, then it might create 9 shards with
>> 1-4 manifests each. The service could send those shards to the client as a
>> list of manifests to read and the client could send the shard information
>> back to the service to get the data files in each shard (along with the
>> original filter).
>>
>> There's a slight trade-off that the protocol needs to define how to break
>> the work into shards. I'm interested in hearing if that would work with how
>> you were planning on building the service on your end. Another option is to
>> let the service send back arbitrary JSON that would get returned for each
>> shard. Either way, I like that this would make it so the service doesn't
>> need to persist anything. We could also make it so that small tables don't
>> require multiple requests. For example, a client could call the route to
>> get file tasks with just a filter.
>>
>> What do you think?
>>
>> Ryan
>>
>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>> <rcher...@amazon.com.invalid> wrote:
>>
>>> Hi all,
>>>
>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at Amazon
>>> EMR and Athena. I’m reaching out to share a proposal for a new Scan API
>>> that will be utilized by the RESTCatalog. The process for table scan
>>> planning is currently done within client engines such as Apache Spark. By
>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>> table scans with external services, which can lead to several benefits.
>>>
>>> For example, we can leverage caching and indexes on the server side to
>>> improve planning performance. Furthermore, by moving this scan logic to the
>>> RESTCatalog, non-JVM engines can integrate more easily. This all can be
>>> found in the detailed proposal below. Feel free to comment, and add your
>>> suggestions .
>>>
>>> Detailed proposal:
>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>
>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>
>>> Regards,
>>>
>>> Rahil Chertara
>>> Amazon EMR & Athena
>>> rcher...@amazon.com
>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to