Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Alessandro Molina Fri, 02 Feb 2024 05:58:05 -0800

On Wed, Dec 6, 2023 at 7:45 PM Ian Cook <ianmc...@apache.org> wrote:


>
> I am interested to hear more perspectives on this. My perspective is
> that we should recommend using HTTP conventions to keep clean
> separation between the Arrow-formatted binary data payloads and the
> various application-specific fields. This can be achieved by encoding
> application-specific fields in URI paths, query parameters, headers,
> or separate parts of multipart/form-data messages.
>

Submitting big binary data in POST messages via multipart/form-data is
usually not very performant,
in theory the boundary of the message has to be constructed by verifying
that it does not collide with
the content of the data itself. Which for huge files means traversing the
whole file in search of the bytes
matching the boundary.
Many implementation are optimistic based on the fact that there are very
little
chances that a long enough randomly generated boundary will be contained in
the message, but this is
not guaranteed to be true and I would refrain from suggesting an approach
that, even though it's remote,
has a chance of being slow or not working.

Also most HTTP servers tend to implement a maximum request time to reduce
the risk of exhausting the maximum
available connections with broken (or malicious) clients that leave the
connection open for too long.
So uploading a 1GB file in a single POST is at serious risk of failing in
most deployments.

There is also the issue that for multipart/form-data a maximum transferred
data size exists as the content of files is frequently saved
in a temporary file by the HTTP server before it gets forwarded to the
server side application. Thus opening
the system for an out of disk error if a client uploads too big data and no
limit is configured.

So I would suggest that any recommended approach to submit Arrow data via
HTTP relies on Content-Range and chunked uploads
to transmit the data, thus reducing the risk of timeouts or size limits.
And allowing to simply resend a chunk in case of those.

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Reply via email to