Hello, For Arrow Datasets, I've been working to instrument the scanner to find bottlenecks. For example, here's a demo comparing the current async scanner, which doesn't truly read asynchronously, to one that does; it should be fairly evident where the bottleneck is: https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
I'd like to upstream this, but I'd like to run some questions by everyone first: - Does this look useful to developers working on other sub-projects? - This uses OpenTelemetry[1], which is still in alpha, so are we comfortable with adopting it? Is the overhead acceptable? - Is there anyone using Arrow to build services, that would find more general integration useful? How it works: OpenTelemetry[1] is used to annotate and record a "span" for operations like reading a single record batch. The data is saved as JSON, then rendered by some JavaScript. The branch is at [2]. As a quick summary, OpenTelemetry implements distributed tracing, in which a request is tracked as a directed acyclic graph of spans. A span is just metadata (name, ID, start/end time, parent span, ...) about an operation (function call, network request, ...). Typically, it's used in services. Spans can reference each other across machines, so you can track a request across multiple services (e.g. finding which service failed/is unusually slow in a chain of services that call each other). As opposed to a (sampling) profiler, this gives you application-level metadata, like filenames or S3 download rates, that you can use in analysis (as in the demo). It's also something you'd always keep turned on (at least when running a service). If integrated with Flight, OpenTelemetry would also give us a performance picture across multiple machines - speculatively, something like making a request to a Flight service and being able to trace all the requests it makes to S3. It does have some overhead; you wouldn't annotate every function in a codebase. This is rather anecdotal, but for the demo above, there was essentially zero impact on runtime. Of course, that demo records very little data overall, so it's not very representative. Alternatives: - Add a simple Span class of our own, and defer Flight until later. - Integrate OpenTelemetry in such a way that it gets compiled out if not enabled at build time. This would be messier but should alleviate any performance questions. - Use something like Perfetto[3] or LLVM XRay[4]. They have their own caveats (e.g. XRay is LLVM-specific) and aren't intended for the multi-machine use case, but would otherwise work. I haven't looked into these much, but could evaluate them, especially if they seem more fit for purpose for use in other Arrow subprojects. If people aren't super enthused, I'll most likely go with adding a custom Span class for Datasets, and defer the question of whether we should integrate Flight/Datasets with OpenTelemetry until another use case arises. But recently we have seen interest in this - so I see this as perhaps a chance to take care of two problems at once. Thanks, David [1]: https://opentelemetry.io/ [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry [3]: https://perfetto.dev/ [4]: https://llvm.org/docs/XRay.html
