On Wed, Dec 6, 2023 at 7:45 PM Ian Cook <ianmc...@apache.org> wrote:
> > I am interested to hear more perspectives on this. My perspective is > that we should recommend using HTTP conventions to keep clean > separation between the Arrow-formatted binary data payloads and the > various application-specific fields. This can be achieved by encoding > application-specific fields in URI paths, query parameters, headers, > or separate parts of multipart/form-data messages. > Submitting big binary data in POST messages via multipart/form-data is usually not very performant, in theory the boundary of the message has to be constructed by verifying that it does not collide with the content of the data itself. Which for huge files means traversing the whole file in search of the bytes matching the boundary. Many implementation are optimistic based on the fact that there are very little chances that a long enough randomly generated boundary will be contained in the message, but this is not guaranteed to be true and I would refrain from suggesting an approach that, even though it's remote, has a chance of being slow or not working. Also most HTTP servers tend to implement a maximum request time to reduce the risk of exhausting the maximum available connections with broken (or malicious) clients that leave the connection open for too long. So uploading a 1GB file in a single POST is at serious risk of failing in most deployments. There is also the issue that for multipart/form-data a maximum transferred data size exists as the content of files is frequently saved in a temporary file by the HTTP server before it gets forwarded to the server side application. Thus opening the system for an out of disk error if a client uploads too big data and no limit is configured. So I would suggest that any recommended approach to submit Arrow data via HTTP relies on Content-Range and chunked uploads to transmit the data, thus reducing the risk of timeouts or size limits. And allowing to simply resend a chunk in case of those.