You can also count me in as a mentor. John
On Wed, Aug 2, 2017 at 3:14 PM Steve Lawrence <stephen.d.lawre...@gmail.com> wrote: > Understood. Thanks for the interest! > > - Steve > > On 08/02/2017 02:57 PM, Dave Fisher wrote: > > Hi Steve, > > > > It was not so much the lack of committers as it was the current > diversity. That is not a blocker for entry to Incubation. > > > > I am willing to be one of the Mentors. Once there are at least two more > we can push forward. > > > > Regards, > > Dave > > > >> On Aug 1, 2017, at 5:09 AM, Steve Lawrence < > stephen.d.lawre...@gmail.com> wrote: > >> > >> Discussions have died down, and I think the consensus from the responses > >> is that the issues are 1) the lack of committers and 2) the lack of a > >> champion and mentors. We hope to address #1 and grow the community as > >> part of incubation. Is anyone interested in being a champion or mentor > >> and help us with #2? > >> > >> Thanks, > >> - Steve > >> > >> On 07/26/2017 04:06 PM, Chris Mattmann wrote: > >>> This sounds like a very interesting project. > >>> > >>> I don’t have the time to mentor at the moment but I will keep a close > eye on it. > >>> > >>> Cheers, > >>> Chris Mattmann > >>> > >>> > >>> > >>> > >>> On 7/25/17, 11:53 AM, "McHenry, Kenton Guadron" <mche...@illinois.edu> > wrote: > >>> > >>> Hi Dave, > >>> > >>> The developers that were at NCSA have moved on to other > organizations. While we still leverage Daffodil and are very much > interested in seeing it move forward, development is currently done by the > Tresys team. Agreed on the synergy with Tika. > >>> > >>> Kenton McHenry, Ph.D. > >>> Principal Research Scientist, Adjunct Assistant Professor of > Computer Science > >>> Deputy Director of the Scientific Software & Applications Division > >>> National Center for Supercomputing Applications, University of > Illinois at Urbana-Champaign > >>> > >>> On Jul 24, 2017, at 1:55 PM, Dave Fisher <dave2w...@comcast.net > <mailto:dave2w...@comcast.net>> wrote: > >>> > >>> Hi Kenton, > >>> > >>> Is there any reason that you and others from the NCSA are not > Initial Committers? That would make this proposal stronger. > >>> > >>> Regarding Apache Tika - it relies on other projects including > Apache POI and Apache PDFBox. They are pragmatic about what is used. If > Daffodil works to expand then I think that there would be good synergy > between the projects. I know as a POI PMC member that the POI community has > significantly benefited from the Tika community some of whom are from Mitre. > >>> > >>> To date Tika has not emphasized structured data, although they do > extract content from Excel and OpenOffice. > >>> > >>> I am intrigued. > >>> > >>> Regards, > >>> Dave > >>> > >>> On Jul 24, 2017, at 10:55 AM, McHenry, Kenton Guadron < > mche...@illinois.edu<mailto:mche...@illinois.edu>> wrote: > >>> > >>> Yes, DFDL and its open source implementation Daffodil are more > about file formats and getting access to the entirety of a file's contents > in a consistent way through machine readable specifications. The work has > implications in the area of digital preservation allowing one to preserve > these machine readable specifications rather than all the tools needed to > open/save a file in order to work with it. Imagine someone developing > graphics software to work with 3D models and not having to worry about the > hundreds of formats out there for 3D meshes (whether there are tools for > opening the files and whether they can get access to those tools, whether > the spec is available and worrying about how complex that spec is to > implement, etc.), and simply building their code around the contents (e.g. > vertices, faces, etc.). One could come up with similar scenarios for other > data types (documents, images, videos, audio, depth data, numeric data). > Ideally tools built supporting DFDL, could someday, support any format for > that type without the developer having to worry about the details of how > that data is represented within a file. > >>> > >>> Kenton McHenry, Ph.D. > >>> Principal Research Scientist, Adjunct Assistant Professor of > Computer Science > >>> Deputy Director of the Scientific Software & Applications Division > >>> National Center for Supercomputing Applications, University of > Illinois at Urbana-Champaign > >>> > >>> On Jul 24, 2017, at 10:30 AM, Steve Lawrence < > stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto: > stephen.d.lawre...@gmail.com>> wrote: > >>> > >>> I'll preface this saying that I don't have a ton of experience with > >>> Apache Tika. But based on my understanding, Tika and Daffodil do > have > >>> somewhat similar goals, but reach them in different ways. For > example, > >>> Tika requires that one writes /code/ to perform data extraction, > usually > >>> relying on existing Java libraries to extract the desired metadata. > The > >>> downside to this is that code can be buggy, and libraries might not > even > >>> exist for formats of interest (especially common with legacy and > >>> military data). > >>> > >>> Daffodil, on the other hand, does not require one to write any code. > >>> Instead, one writes a DFDL Schema (similar to XML Schema, with DFDL > >>> annotations) that fully describes the data, which Daffodil then > uses to > >>> convert the data to XML/JSON for extraction. So adding support for > a new > >>> format means writing a new schema rather than new code. And less > code > >>> generally means less bugs. Also, for secure systems that require > >>> certification, generally speaking, it is easier to certify a schema > as > >>> compared to code. > >>> > >>> We certainly don't believe that Daffodil could replace Tika, but it > does > >>> have the potential to add new functionality to Tika for formats > that do > >>> not have existing libraries. One of our goals is to look into > >>> integrating Daffodil support into tools like Tika. We'd love to hear > >>> from Tika devs if this is something they'd be interested in. > >>> > >>> I'll also add that whereas Tika tends to focus primarily on > metadata, > >>> DFDL schemas usually describe an entire file format down to the > byte, so > >>> one can extract more than just meta data, including text and binary > >>> data. Further differentiating, Daffodil has support for serializing > data > >>> (called unparse) from the XML/JSON representation, allowing one to > >>> transform or filter data as well. We don't believe this feature is > all > >>> that applicable to Tika, but may be useful to other technologies > such as > >>> filtering or data fuzzing technologies. > >>> > >>> - Steve > >>> > >>> > >>> On 07/24/2017 10:59 AM, Mike Drob wrote: > >>> What is the relationship between Daffodil and something like Apache > Tika's > >>> extraction engine? > >>> > >>> On Mon, Jul 24, 2017 at 9:53 AM, Steve Lawrence < > >>> stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com > ><mailto:stephen.d.lawre...@gmail.com>> wrote: > >>> > >>> Dear Apache Incubator Community, > >>> > >>> We would like to start a discussion around a proposal to bring > Daffodil > >>> into the Apache Incubator. Daffodil is a implementation of the DFDL > >>> specification used to convert between fixed format data and > XML/JSON. > >>> > >>> The draft proposal can be found in the wiki at the following URL: > >>> > >>> https://wiki.apache.org/incubator/DaffodilProposal > >>> > >>> We do not yet have a champion or mentors, but it was recommended > that we > >>> create a proposal and send it to this list to potentially find those > >>> that might be interested. The text for the draft proposal is found > >>> below. We look forward to your input. > >>> > >>> Thanks, > >>> -Steve > >>> > >>> > >>> = Daffodil Proposal = > >>> > >>> == Abstract == > >>> > >>> Daffodil is an implementation of the Data Format Description > Language > >>> (DFDL) used to convert between fixed format data and XML/JSON. > >>> > >>> == Proposal == > >>> > >>> The Data Format Description Language (DFDL) is a specification, > >>> developed by the Open Grid Forum, capable of describing many data > >>> formats, including both textual and binary, scientific and numeric, > >>> legacy and modern, commercial record-oriented, and many industry and > >>> military standards. It defines a language that is a subset of W3C > XML > >>> schema to describe the logical format of the data, and annotations > >>> within the schema to describe the physical representation. > >>> > >>> Daffodil is an open source implementation of the DFDL specification > that > >>> uses these DFDL schemas to parse fixed format data into an infoset, > >>> which is most commonly represented as either XML or JSON. This > allows > >>> the use of well-established XML or JSON technologies and libraries > to > >>> consume, inspect, and manipulate fixed format data in existing > >>> solutions. Daffodil is also capable of the reverse by serializing or > >>> "unparsing" an XML or JSON infoset back to the original data format. > >>> > >>> == Background == > >>> > >>> Many different software solutions need to consume and manage data, > >>> including data directed routing, databases, data analysis, data > >>> cleansing, data visualizing, and more. A key aspect of such > solutions is > >>> the need to transform the data into an easily consumable format. > >>> Usually, this means that for each unique data format, one develops a > >>> tool that can read and extract the necessary information, often > leading > >>> to ad-hoc and data-format-specific description systems. Such > systems are > >>> often proprietary, not well tested, and incompatible, leading to > vendor > >>> lock-in, flawed software, and increased training costs. DFDL is a > new > >>> standard, with version 1.0 completed in October of 2016, that solves > >>> these problems by defining an open standard to describe many > different > >>> data formats and how to parse and unparse between the data and > XML/JSON. > >>> > >>> Two closed source implementations of DFDL currently exist. The > first was > >>> created by IBM and is now part of their IBM® Integration Bus > product. > >>> The second was created by the European Space Agency, called DFDL4S > or > >>> "DFDL for Space" targeted at the challenges of their satellite data > >>> processing. > >>> > >>> Around 2005, Pacific Northwest National Lab created Defuddle, built > as > >>> an open source implementation and proof of concept of the draft DFDL > >>> specification and a test bed to feed new concepts into specification > >>> development. Primary development of Defuddle was eventually taken > over > >>> by the National Center for Supercomputing Applications (NCSA). > However, > >>> due to evolution of the DFDL specification and architectural and > >>> performance issues with Defuddle, around 2009, NCSA restarted the > >>> project with the new name of Daffodil, with a goal of implementing > the > >>> complete DFDL specification. Daffodil development continued at NCSA > >>> until around 2012, at which point development slowed due to budget > >>> limitations. Shortly thereafter, primary development was picked up > by > >>> Tresys Technology where it continues today, with contributions from > >>> other entities such as the Navy Research Lab, the Air Force Research > >>> Lab, MITRE, and Booz Allen Hamilton. In February of 2015, Daffodil > >>> version 1.0.0 was released, including support for the DFDL features > >>> needed to parse many common file formats. Daffodil version 2.0.0 is > >>> expected to be released in August of 2017, which will include > unparse > >>> support with one-to-one parsing feature parity. > >>> > >>> Entities including IBM, MITRE, NATO NCI Agency, Northrop-Grumman, > Quark > >>> Security, Raytheon, and Tresys Technology have developed DFDL > schemas > >>> for many data formats from varying technology domains, including > PNG, > >>> GIF, BMP, PCAP, HL7, EDIFACT, NACHA, vCard, iCalendar, and > MIL-STD-2045, > >>> many of which are publicly available on the DFDL Schemas github. > There > >>> are also a number of military-application data formats, the > >>> specifications of which are not public, which have historically been > >>> very difficult and expensive to process, and for which DFDL schemas > have > >>> been created or are actively in development; these include > >>> MIL-STD-6040/USMTF ATO, MIL-STD-6017/VMF, MIL-STD-6016/NATO STANAG > 5516 > >>> (aka "Link16"). > >>> > >>> == Rationale == > >>> > >>> Numerous software solutions exist that consume, inspect, analyze, > and > >>> transform data, many of which can be found in the Apache Software > >>> Foundation (ASF). In order for tools like these to consume new > types of > >>> data, custom extensions are usually required, often with high > >>> development and testing costs. Daffodil fills a clear gap in many of > >>> these solutions, providing a simple and low cost way to transform > data > >>> to XML or JSON, which many of these tools natively support already. > With > >>> the upcoming 2.0.0 release, the Daffodil project will have achieved > a > >>> level of functionality in both parse and unparse that, when > integrated > >>> into existing solutions, could provide for a new method to quickly > >>> enable support for new data formats. > >>> > >>> == Initial Goals == > >>> > >>> * Relicense the existing code from the University of Illinois/NCSA > Open > >>> Source License to the Apache License version 2.0, working with > Apache > >>> Legal to ensure correctness, and with Daffodil contributors to get > >>> their permission. > >>> * Move the existing codebase, documentation, bugs, and mailing > lists to > >>> the Apache hosted infrastructure > >>> * Establish a formal release process and schedule, allowing for > >>> dependable release cycles in a manner consistent with the Apache > >>> development process. > >>> * Build relationships with ASF projects to add Daffodil support > where > >>> appropriate > >>> * Grow the community to establish a diversity of background and > expertise. > >>> > >>> == Current Status == > >>> > >>> === Meritocracy === > >>> > >>> All initial committers are familiar with the principles of > meritocracy. > >>> The Daffodil project has followed the model of meritocracy in the > past, > >>> providing multiple outside entities commit access based on the > quality > >>> of their contributions. In order to grow the Daffodil user base and > >>> development community, we are dedicated to continuing to operate > >>> Daffodil as a meritocracy. > >>> > >>> A key ingredient in a meritocracy of developers is open group code > >>> review. The Daffodil project has operated in this mode throughout > its > >>> existence and this provides a forum to improve the code, verify code > >>> quality, and educate new developers on the code base. > >>> > >>> === Community === > >>> > >>> Daffodil has a small community of users and developers. Although > primary > >>> Daffodil development is done by Tresys Technology, a handful of > other > >>> contributions have come from other entities including the Navy > Research > >>> Lab, the Air Force Research Lab, MITRE, and Booz Allen Hamilton. In > >>> addition to developers, multiple users of Daffodil have created DFDL > >>> schemas, including entities such as MITRE, IBM, Raytheon, Quark > >>> Security, and Tresys Technology. The DFDL Schemas github community > has > >>> been created as a place for DFDL schemas to be published. The > Daffodil > >>> project also makes use of mailing lists, !HipChat, and Confluence > >>> Questions to build a community of users and system for support. > >>> > >>> === Core Developers === > >>> > >>> The core developers of Daffodil are employed by Tresys Technology. > We > >>> will work to grow the community among a more diverse set of > developers > >>> and industries. > >>> > >>> === Alignment === > >>> > >>> Daffodil was created as an open source project with a philosophy > >>> consistent with The Apache Way. A strong belief in meritocracy, > >>> community involvement in decisions, openness, and ensuring a high > level > >>> of quality in code, documentation, and testing are some of our > shared > >>> core beliefs. > >>> > >>> Further, as mentioned in the Rationale section, Daffodil fills a gap > >>> that exists in many ASF projects, including !NiFi, Spark, Storm, > Hadoop, > >>> Tika, and others. In order for tools like these to consume new > types of > >>> data, custom extensions are usually required. Rather than create > such > >>> extensions, Daffodil provides an easy and standards-compliant way to > >>> transform data to XML or JSON, which many of these tools already > >>> natively support. > >>> > >>> == Known Risks == > >>> > >>> === Orphaned Products === > >>> > >>> The current core developers are the leading contributors in the > space of > >>> DFDL and wish to see it flourish. Though there is some risk that the > >>> initial committers all come from the same company, a goal of > entering > >>> into incubation is to grow the development community to minimize the > >>> risk of reliance on a single company. > >>> > >>> === Inexperience with Open Source === > >>> > >>> The Daffodil project began as an open source project and has > continued > >>> that model throughout development. This includes public bug > tracking, > >>> git revision control, automated builds and tests, and a public wiki > for > >>> documentation. > >>> > >>> Additionally, the current core developers and initial committers all > >>> work for a company that relies on, believes in, promotes, and has > led or > >>> contributed to many open source software projects, including SELinux > >>> Userspace, OpenSCAP, CLIP, refpolicy, setools, RPM, and others. As > such, > >>> there is low risk related to inexperience with open source software > and > >>> processes. > >>> > >>> === Homogeneous Developers === > >>> > >>> The proposed initial committers come from a single entity, though > we are > >>> committed to growing the Daffodil development community to include a > >>> broad group of additional committers from a wide array of > industries. > >>> > >>> === Reliance on Salaried Developers === > >>> > >>> The proposed initial committers are paid by their employer to > contribute > >>> to the Daffodil project. We expect that Daffodil development will > >>> continue with salaried developers, and are committed to growing the > >>> community to include non-salaried developers as well. > >>> > >>> === Relationship with other Apache Projects === > >>> > >>> As mentioned in the Alignment section, Daffodil fills a clear gap in > >>> numerous other ASF projects that consume and manage large amounts > of data. > >>> > >>> As a specific example, Daffodil developers have created a Daffodil > >>> Apache !NiFi Processor, currently in use in data transfer solutions, > >>> which allows one to ingest non-native data into an Apache !NiFi > pipeline > >>> as XML or JSON. This processor was well received by the Apache !NiFi > >>> developers, with positive comments about the concise API and how it > >>> could handle non-native data. Daffodil developers have also > successfully > >>> prototyped integration with Apache Spark. We believe Daffodil could > >>> provide a strong benefit to many other ASF projects that handle > fixed > >>> format data. We anticipate working closely with such ASF projects to > >>> include Daffodil where applicable to increase their ability to > support > >>> new data formats with minimal effort. > >>> > >>> Daffodil also depends on existing ASF projects, including Apache > Commons > >>> and Apache Xerces. > >>> > >>> === An Excessive Fascination with the Apache Brand === > >>> > >>> Although the Apache brand may certainly help to attract more > >>> contributors, publicity is not the reason for this proposal. We > believe > >>> Daffodil could provide a great benefit to the ASF and the numerous > data > >>> focused projects that comprise it, as described in the Rationale and > >>> Alignment sections. We hope to build a strong and vibrant community > >>> built around The Apache Way, and not dependent on a single company. > >>> > >>> === Documentation === > >>> > >>> Daffodil documentation can be found at: > >>> > >>> * > >>> https://opensource.ncsa.illinois.edu/confluence/ > >>> display/DFDL/Daffodil%3A+Open+Source+DFDL > >>> > >>> Information about DFDL can be found at: > >>> > >>> * https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl > >>> * > >>> https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0. > >>> 0/com.ibm.etools.mft.doc/df20060_.htm > >>> > >>> Public examples of DFDL Schemas can be found at: > >>> > >>> * https://github.com/DFDLSchemas > >>> > >>> == Initial Source == > >>> > >>> The Daffodil git repo goes back to mid-2011 with approximately 20 > >>> different contributors and feedback from many users and developers. > The > >>> core codebase is written in Scala and includes both a Scala and Java > >>> API, along with Javadocs and Scaladocs for API usage. The initial > code > >>> will come from the git repository currently hosted by NCSA at the > >>> University of Illinois : > >>> > >>> https://opensource.ncsa.illinois.edu/bitbucket/ > >>> projects/DFDL/repos/daffodil/ > >>> > >>> == Source and Intellectual Property Submission == > >>> > >>> The complete Daffodil code is licensed under the University of > >>> Illinois/NCSA Open Source License. Much of the current codebase has > been > >>> developed by Tresys Technology, who is open to relicensing the code > to > >>> the Apache License version 2.0 and donate the source to the ASF. > >>> Contacts at NCSA are also open to relicensing their contributions to > >>> Apache v2. We plan to contact the other contributors and ask for > >>> permission to relicense and donate their contributed code. For those > >>> that decline or we cannot contact, their code will be removed or > >>> replaced. We will work closely with Apache Legal to ensure all > issues > >>> related to relicensing are acceptable. > >>> > >>> == External Dependencies == > >>> > >>> We believe all current dependencies are compatible with the ASF > >>> guidelines. Our dependency licenses come from the following license > >>> styles: Apache v2, BSD, MIT, and ICU. The list of current Daffodil > >>> dependencies and their licenses are documented here: > >>> > >>> https://opensource.ncsa.illinois.edu/confluence/ > >>> display/DFDL/Dependencies+and+Licenses > >>> > >>> == Cryptography == > >>> > >>> None > >>> > >>> == Required Resources == > >>> > >>> === Mailing Lists === > >>> > >>> * comm...@daffodil.incubator.apache.org > >>> * d...@daffodil.incubator.apache.org > >>> * priv...@daffodil.incubator.apache.org > >>> * u...@daffodil.incubator.apache.org > >>> > >>> === Source Control === > >>> > >>> git://git.apache.org/incubator-daffodil.git > >>> > >>> === Issue Tracking === > >>> > >>> JIRA Daffodil (DFDL) > >>> > >>> === Initial Committers === > >>> > >>> * Beth Finnegan <efinnegan at tresys dot com> > >>> * Dave Thompson <dthompson at tresys dot com> > >>> * Josh Adams <jadams at tresys dot com> > >>> * Mike Beckerle <mbeckerle at tresys dot com> > >>> * Steve Lawrence <slawrence at tresys dot com> > >>> * Taylor Wise <twise at tresys dot com> > >>> > >>> === Affiliations === > >>> > >>> * Beth Finnegan (Tresys Technology) > >>> * Dave Thompson (Tresys Technology) > >>> * Josh Adams (Tresys Technology) > >>> * Mike Beckerle (Tresys Technology) > >>> * Steve Lawrence (Tresys Technology) > >>> * Taylor Wise (Tresys Technology) > >>> > >>> == Sponsors == > >>> > >>> === Champion === > >>> > >>> * TBD > >>> > >>> === Nominated Mentors === > >>> > >>> * TBD > >>> > >>> === Sponsoring Entity === > >>> > >>> We request the Apache Incubator to sponsor this project. > >>> > >>> > --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>> For additional commands, e-mail: general-h...@incubator.apache.org > >>> > >>> > >>> > >>> > >>> > >>> > --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > <mailto:general-unsubscr...@incubator.apache.org> > >>> For additional commands, e-mail: general-h...@incubator.apache.org > <mailto:general-h...@incubator.apache.org> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>> For additional commands, e-mail: general-h...@incubator.apache.org > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >> For additional commands, e-mail: general-h...@incubator.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org >