Hi Everyone,

The feedback was generally positive on the discussion thread [1] so I'd
like to start a formal vote for merging HDFS-8707 (libhdfs++) into trunk.
The vote will be open for 7 days and end 6PM EST on 3/15/18.

This branch includes a C++ implementation of an HDFS client for use in
applications that don't run an in-process JVM.  Right now the branch only
supports reads and metadata calls.

Features (paraphrasing the list from the discussion thread):
-Avoiding the JVM means applications that use libhdfs++ can explicitly
control resources (memory, FDs, threads).  The driving goal for this
project was to let C/C++ applications access HDFS while maintaining a
single heap.
-Includes support for Kerberos authentication.
-Includes a libhdfs/libhdfs3 compatible C API as well as a C++ API that
supports asynchronous operations.  Applications that only do reads may be
able to use this as a drop in replacement for libhdfs.
-Asynchronous IO is built on top of boost::asio which in turn uses
select/epoll so many sockets can be monitored from a single thread (or
thread pool) rather than spawning a thread to sleep on a blocked socket.
-Includes a set of utilities written in C++ that mirror the CLI tools (e.g.
./hdfs dfs -ls).  These have a 3 order of magnitude lower startup time than
java client which is useful for scripts that need to work with many files.
-Support for cancelable reads that release associated resources
immediately.  Useful for applications that need to be responsive to
interactive users.

Other points:
-This is almost all new code in a new subdirectory.  No Java source for the
rest of hadoop was changed so there's no risk of regressions there.  The
only changes outside of that subdirectory were integrating the build in
some of the pom files and adding a couple dependencies to the DockerFile.
-The library has had plenty of burn-in time.  It's been used in production
for well over a year and is indirectly being distributed as part of the
Apache ORC project (in the form of a third party dependency).
-There isn't much in the way of well formatted documentation right now.
The documentation for the libhdfs API is applicable to the libhdfs++ C API.
Header files describe various component including details about threading
and lifecycle expectations for important objects.  Good places to start are
hdfspp.h, filesystem.h, filehandle.h, rpc_connection.h and rpc_enginel.h.

I'll start with my +1 (binding).

[1]
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201803.mbox/browser
(second message in thread, can't figure out how to link directly to mine)

Thanks!

Reply via email to