Definitely this would be great addition. Kudos to everyone's contributions.
I am not a C++ expert. So cannot vote on code. ---A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a drop-in replacement for clients that only need read support (until libhdfs++ also supports writes). Wouldn't it be nice to have write support as well before merge...? If everyone feels its okay to have read alone for now, I am okay anyway. On 1 Mar 2018 11:35 pm, "Jim Clampffer" <james.clampf...@gmail.com> wrote: > Thanks for the feedback Chris and Kai! > > Chris, do you mean potentially landing this in its current state and > handling some of the rough edges after? I could see this working just > because there's no impact on any existing code. > > With regards to your questions Kai: > There isn't a good doc for the internal architecture yet; I just reassigned > HDFS-9115 to myself to handle that. Are there any specific areas you'd > like to know about so I can prioritize those? > Here's some header files that include a lot of comments that should help > out for now: > -hdfspp.h - main header for the C++ API > -filesystem.h and filehandle.h - describes some rules about object > lifetimes and threading from the API point of view (most classes have > comments describing any restrictions on threading, locking, and lifecycle). > -rpc_engine.h and rpc_connection.h begin getting into the async RPC > implementation. > > > 1) Yes, it's a reimplementation of the entire client in C++. Using > libhdfs3 as a reference helps a lot here but it's still a lot of work. > 2) EC isn't supported now, though that'd be great to have, and I agree that > it's going to be take a lot of effort to implement. Right now if you tried > to read an EC file I think you'd get some unhelpful error out of the block > reader but I don't have an EC enabled cluster set up to test. Adding an > explicit not supported message would be straightforward. > 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already > had so we get consistency checks on the C API. There's a few new tests > that also get run on both libhdfs and libhdfs++ and make sure the expected > output is the same too. > 4) I agree, I just haven't had a chance to look into the distribution build > to see how to do it. HDFS-9465 is tracking this. > 5) Not yet (HDFS-8765). > > Regards, > James > > > > > On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zhengkai...@alibaba-inc.com> > wrote: > > > The work sounds solid and great! + to have this. > > > > Is there any quick doc to take a glance at? Some quick questions to be > > familiar with: > > 1. Seems the client is all implemented in c++ without any Java codes (so > > no JVM overhead), which means lots of work, rewriting HDFS client. Right? > > 2. Guess erasure coding feature isn't supported, as it'd involve > > significant development, right? If yes, what will it say when read > erasure > > coded file? > > 3. Is there any building/testing mechanism to enforce the consistency > > between the c++ part and Java part? > > 4. I thought the public header and lib should be exported when building > > the distribution package, otherwise hard to use the new C api. > > 5. Is the short-circuit read supported? > > > > Thanks. > > > > > > Regards, > > Kai > > > > ------------------------------------------------------------------ > > 发件人:Chris Douglas <cdoug...@apache.org> > > 发送时间:2018年3月1日(星期四) 05:08 > > 收件人:Jim Clampffer <james.clampf...@gmail.com> > > 抄 送:Hdfs-dev <hdfs-dev@hadoop.apache.org> > > 主 题:Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk > > > > +1 > > > > Let's get this done. We've had many false starts on a native HDFS > > client. This is a good base to build on. -C > > > > On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer > > <james.clampf...@gmail.com> wrote: > > > Hi everyone, > > > > > > I'd like to start a thread to discuss merging the HDFS- > > 8707 aka libhdfs++ > > > into trunk. I sent originally sent a similar > > email out last October but it > > > sounds like it was buried by discussions about other feature merges > that > > > were going on at the time. > > > > > > libhdfs++ is an HDFS client written in C++ designed to be used in > > > applications that are written in non-JVM based > > languages. In its current > > > state it supports kerberos authenticated reads from HDFS > > and has been used > > > in production clusters for over a year so it has had a > > significant amount > > > of burn-in time. The HDFS-8707 branch has been around for about 2 > years > > > now so I'd like to know people's thoughts on what it would take to > merge > > > current branch and handling writes and encrypted reads in a new one. > > > > > > Current notable features: > > > -A libhdfs/libhdfs3 compatible C API that allows > > libhdfs++ to serve as a > > > drop-in replacement for clients that only need read support (until > > > libhdfs++ also supports writes). > > > -An asynchronous C++ API with synchronous shims on top if the client > > > application wants to do blocking operations. Internally a single > thread > > > (optionally more) uses select/epoll by way of boost::asio to watch > > > thousands of sockets without the overhead of spawning threads to > emulate > > > async operation. > > > -Kerberos/SASL authentication support > > > -HA namenode support > > > -A set of utility programs that mirror the HDFS CLI utilities e.g. > > > "./hdfs dfs -chmod". The major benefit of these is the > > tool startup time > > > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies > a > > > lot less memory since it isn't dealing with the JVM. This makes it > > > possible to do things like write a simple bash script that stats a > file, > > > applies some rules to the result, and decides if it > > should move it in a way > > > that scales to thousands of files without being penalized with O(N) JVM > > > startups. > > > -Cancelable reads. This has proven to be very useful in multiuser > > > applications that (pre)fetch large blocks of data but need to remain > > > responsive for interactive users. Rather than waiting > > for a large and/or > > > slow read to finish it will return immediately and the > > associated resources > > > (buffer, file descriptor) become available for the rest > > of the application > > > to use. > > > > > > There are a couple known issues: the doc build isn't integrated with > the > > > rest of hadoop and the public API headers aren't being exported when > > > building a distribution. A short term solution for > > missing docs is to go > > > through the libhdfs(3) compatible API and use the > > libhdfs docs. Other than > > > a few modifications to the pom files to integrate the > > build and the changes > > > are isolated to a new directory so the chance of > > causing any regressions in > > > the rest of the code is minimal. > > > > > > Please share your thoughts, thanks! > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org > > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org > > > > >