Thanks for the feedback Chris and Kai! Chris, do you mean potentially landing this in its current state and handling some of the rough edges after? I could see this working just because there's no impact on any existing code.
With regards to your questions Kai: There isn't a good doc for the internal architecture yet; I just reassigned HDFS-9115 to myself to handle that. Are there any specific areas you'd like to know about so I can prioritize those? Here's some header files that include a lot of comments that should help out for now: -hdfspp.h - main header for the C++ API -filesystem.h and filehandle.h - describes some rules about object lifetimes and threading from the API point of view (most classes have comments describing any restrictions on threading, locking, and lifecycle). -rpc_engine.h and rpc_connection.h begin getting into the async RPC implementation. 1) Yes, it's a reimplementation of the entire client in C++. Using libhdfs3 as a reference helps a lot here but it's still a lot of work. 2) EC isn't supported now, though that'd be great to have, and I agree that it's going to be take a lot of effort to implement. Right now if you tried to read an EC file I think you'd get some unhelpful error out of the block reader but I don't have an EC enabled cluster set up to test. Adding an explicit not supported message would be straightforward. 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already had so we get consistency checks on the C API. There's a few new tests that also get run on both libhdfs and libhdfs++ and make sure the expected output is the same too. 4) I agree, I just haven't had a chance to look into the distribution build to see how to do it. HDFS-9465 is tracking this. 5) Not yet (HDFS-8765). Regards, James On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zhengkai...@alibaba-inc.com> wrote: > The work sounds solid and great! + to have this. > > Is there any quick doc to take a glance at? Some quick questions to be > familiar with: > 1. Seems the client is all implemented in c++ without any Java codes (so > no JVM overhead), which means lots of work, rewriting HDFS client. Right? > 2. Guess erasure coding feature isn't supported, as it'd involve > significant development, right? If yes, what will it say when read erasure > coded file? > 3. Is there any building/testing mechanism to enforce the consistency > between the c++ part and Java part? > 4. I thought the public header and lib should be exported when building > the distribution package, otherwise hard to use the new C api. > 5. Is the short-circuit read supported? > > Thanks. > > > Regards, > Kai > > ------------------------------------------------------------------ > 发件人:Chris Douglas <cdoug...@apache.org> > 发送时间:2018年3月1日(星期四) 05:08 > 收件人:Jim Clampffer <james.clampf...@gmail.com> > 抄 送:Hdfs-dev <hdfs-dev@hadoop.apache.org> > 主 题:Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk > > +1 > > Let's get this done. We've had many false starts on a native HDFS > client. This is a good base to build on. -C > > On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer > <james.clampf...@gmail.com> wrote: > > Hi everyone, > > > > I'd like to start a thread to discuss merging the HDFS- > 8707 aka libhdfs++ > > into trunk. I sent originally sent a similar > email out last October but it > > sounds like it was buried by discussions about other feature merges that > > were going on at the time. > > > > libhdfs++ is an HDFS client written in C++ designed to be used in > > applications that are written in non-JVM based > languages. In its current > > state it supports kerberos authenticated reads from HDFS > and has been used > > in production clusters for over a year so it has had a > significant amount > > of burn-in time. The HDFS-8707 branch has been around for about 2 years > > now so I'd like to know people's thoughts on what it would take to merge > > current branch and handling writes and encrypted reads in a new one. > > > > Current notable features: > > -A libhdfs/libhdfs3 compatible C API that allows > libhdfs++ to serve as a > > drop-in replacement for clients that only need read support (until > > libhdfs++ also supports writes). > > -An asynchronous C++ API with synchronous shims on top if the client > > application wants to do blocking operations. Internally a single thread > > (optionally more) uses select/epoll by way of boost::asio to watch > > thousands of sockets without the overhead of spawning threads to emulate > > async operation. > > -Kerberos/SASL authentication support > > -HA namenode support > > -A set of utility programs that mirror the HDFS CLI utilities e.g. > > "./hdfs dfs -chmod". The major benefit of these is the > tool startup time > > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a > > lot less memory since it isn't dealing with the JVM. This makes it > > possible to do things like write a simple bash script that stats a file, > > applies some rules to the result, and decides if it > should move it in a way > > that scales to thousands of files without being penalized with O(N) JVM > > startups. > > -Cancelable reads. This has proven to be very useful in multiuser > > applications that (pre)fetch large blocks of data but need to remain > > responsive for interactive users. Rather than waiting > for a large and/or > > slow read to finish it will return immediately and the > associated resources > > (buffer, file descriptor) become available for the rest > of the application > > to use. > > > > There are a couple known issues: the doc build isn't integrated with the > > rest of hadoop and the public API headers aren't being exported when > > building a distribution. A short term solution for > missing docs is to go > > through the libhdfs(3) compatible API and use the > libhdfs docs. Other than > > a few modifications to the pom files to integrate the > build and the changes > > are isolated to a new directory so the chance of > causing any regressions in > > the rest of the code is minimal. > > > > Please share your thoughts, thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org > >