On 11/29/2022 8:16 AM, Gautham Banasandra wrote:
… However, I don't see anyone stopping you from working on removing winutils. I encourage you to put across a PR and I would be glad to review the same.
That's not how it works. This is an intense undertaking. If I spend six months with no income, just rewriting all the native `FileSystem` implementations, and you simply gave thumbs up to the PR, then yay, Apache would integrate my changes into the codebase? I hardly think so. There has to be official buy-in across the group and authorization to make such extensive changes. It's naive to say, "oh, just go rewrite it, and I'll review it and then it will be done".
However I am interested in who funds your work. Do you work on Apache for free in your extra time? Or does some corporation pay you? If the latter, I'll be happy to submit my resume to them, so that they can fund me as well and I can start the work immediately. But as I've mentioned a couple of times already, financially I cannot justify sitting here rewriting Hadoop file systems without any income. If you find a creative way for it to be financially viable for me, I would love to do it.
One question I've is - how will you validate that your changes work fine and don't regress the existing functionality, given that we don't yet have a CI for Hadoop on Windows?
It's tempting to start to give you a detailed answer here, because it's a legitimate question. The more general answer is that we would discuss and form a plan with the group; you'll likely find that 1) the existing code doesn't even have sufficient tests, and 2) the existing API isn't even sufficiently documented. But your question was formulated in a way completely different than I conceptualize the issue. What I would be writing would be a completely native Java implementation of `FileSystem`. The tests accordingly should be written agnostic to the platform. If the tests run on Linux, they will run on Windows; if not, we need to file a bug against the JDK. I'm not even thinking in terms of a "CI for Hadoop for Windows". I just want to build the Java project, whether I'm running on Mac or Linux or Windows or whatever. (That was the point of my wanting to get rid of Winutils to begin with.)
I also know that pragmatically whatever I do with the `FileSystem` implementation, something will initially break—not because of anything I did incorrectly, but because the Hadoop API is inadequate and people have therefore made a thousand brittle assumptions in their use of the API. Things will break already with or without my `FileSystem` implementation; that's why Hadoop is still using `DeprecatedRawLocalFileStatus`: someone made a new version but had to switch it off because something broke (HADOOP-9652 <https://issues.apache.org/jira/browse/HADOOP-9652> according to the comments).
In summary, yes, if I ever get buy-in and funding to rewrite `FileSystem` for native Java, we need to have a discussion with the wider group to form a plan for improving the documentation and for testing. But whatever discussion or plan we do, things will eventually break because Hadoop doesn't have a well-documented API and doesn't cleanly separate the interface from the implementation. If I were to work on it, I would improve that situation so that things would be better documented and less brittle.
In the meantime my Bare Naked Local FileSystem <https://github.com/globalmentor/hadoop-bare-naked-local-fs> is meeting my needs pragmatically, and I'm leaving this mailing list—not to be antisocial, but because the unrelated (mostly automated) chatter is distracting to my other work.
Have a wonderful holiday season, and feel free to reach out directly. Best, Garret