Hi Everyone, The feedback was generally positive on the discussion thread [1] so I'd like to start a formal vote for merging HDFS-8707 (libhdfs++) into trunk. The vote will be open for 7 days and end 6PM EST on 3/15/18.
This branch includes a C++ implementation of an HDFS client for use in applications that don't run an in-process JVM. Right now the branch only supports reads and metadata calls. Features (paraphrasing the list from the discussion thread): -Avoiding the JVM means applications that use libhdfs++ can explicitly control resources (memory, FDs, threads). The driving goal for this project was to let C/C++ applications access HDFS while maintaining a single heap. -Includes support for Kerberos authentication. -Includes a libhdfs/libhdfs3 compatible C API as well as a C++ API that supports asynchronous operations. Applications that only do reads may be able to use this as a drop in replacement for libhdfs. -Asynchronous IO is built on top of boost::asio which in turn uses select/epoll so many sockets can be monitored from a single thread (or thread pool) rather than spawning a thread to sleep on a blocked socket. -Includes a set of utilities written in C++ that mirror the CLI tools (e.g. ./hdfs dfs -ls). These have a 3 order of magnitude lower startup time than java client which is useful for scripts that need to work with many files. -Support for cancelable reads that release associated resources immediately. Useful for applications that need to be responsive to interactive users. Other points: -This is almost all new code in a new subdirectory. No Java source for the rest of hadoop was changed so there's no risk of regressions there. The only changes outside of that subdirectory were integrating the build in some of the pom files and adding a couple dependencies to the DockerFile. -The library has had plenty of burn-in time. It's been used in production for well over a year and is indirectly being distributed as part of the Apache ORC project (in the form of a third party dependency). -There isn't much in the way of well formatted documentation right now. The documentation for the libhdfs API is applicable to the libhdfs++ C API. Header files describe various component including details about threading and lifecycle expectations for important objects. Good places to start are hdfspp.h, filesystem.h, filehandle.h, rpc_connection.h and rpc_enginel.h. I'll start with my +1 (binding). [1] http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201803.mbox/browser (second message in thread, can't figure out how to link directly to mine) Thanks!