PJ I invite you to join and comment on the Tika lists. We already are working on standards in a number of the areas below, including even beyond some of the basic things you cite. For example we are already doing Sentiment Analysis, Deep Learning, and other NLP and have these integrated into Tika as part of a Broader ecosystem.
Feel free to join the discussion at d...@tika.apache.org. You can also read more about it under Advanced Content Integration here: https://wiki.apache.org/tika/#Advanced_Content_Extraction_with_Tika_-_Integration Look also at NER, Object Detection ,Text Captioning and Computer Vision. Regarding participation in this committee at ECM, I’m definitely interested if it’s worthwhile. Chris Mattmann From: Piergiorgio Lucidi <piergior...@apache.org> Reply-To: "dev@community.apache.org" <dev@community.apache.org> Date: Tuesday, May 22, 2018 at 4:30 PM To: "dev@community.apache.org" <dev@community.apache.org> Subject: ASF involvement in the new ECM Standard Committee Hi, I'm directly involved in the new committee dedicated to design the new white papers about the ECM / Content Services guidelines and toolkits. The main goal of these documents is to suggest best practices, guidelines and, starting from this year, Open Source technology stacks to use in the enterprise context. I worked during the last three years contributing in the AIIM committee with Betsy Fanning but now we will have a new home with a new team. Yesterday I had a very interesting discussion with Robert Blatt about the new direction to follow for the next development. The Open Source topic will be the most relevant one in the next iteration of our work and we are discussing about a potential white paper totally dedicated to the Open Source alternatives in the market. Even if I'm currently contributing as an individual in this committee, it seems that we could be involved as a foundation in this project. I think that It could be a good opportunity to spread our brand also on collaboration like this. We know best practices, approaches and technology stack where we have a huge amount of experience, skills and projects. I'm wondering if the ASF was never been involved in this kind of contributions or if it can be any problem with our involvement on this in terms of brand. I have to ask more details about this program but in the meanwhile I would like to receive some feedbacks from you. I'm asking also because Robert Blatt is very interested to involve us officially in the program. I would like to thank Shane for sharing the framework published by Mozilla some days ago in our ComDev room on HipChat. Mozilla described a very interesting report adding also some technology stacks: https://blog.mozilla.org/blog/2018/05/15/whats-your-open-source-strategy-here-are-10-answers/ Specifically we are talking about areas such as: Content, Search and Capture and even if OCR is not present in our projects, we have some native integrations for example with Tesseract on Tika. It can be interesting to understand which Apache projects can be combined with external libraries to build a custom Capture Services solution. For example considering the involvement of Tesseract, it could be the following proposal: - Apache ManifoldCF for crawling any source content repository (API -> contents as images or PDF) - Apache PDFBox for extracting images from PDF - Apache ManifoldCF for injecting contents in Solr - Tesseract for extracting text from images (configured inside Apache Tika) - Apache Solr for indexing extracted text We could also try to design a section totally dedicated to the Apache technology stacks: - Apache Content Services (JackRabbit, ...) - Apache Search Services (Lucene, Solr, ManifoldCF) - Apache Semantic Services (UIMA, Stanbol, ...) - Apache BigData Services (Hadoop, ...) - Apache DevOps Services (Mesos, ...) - Apache Libraries Services (Commons, ...) - ... and so on :-P This potential work can be useful internally for us to create our new Apache brochures dedicated to specific areas of our proposal. I'm not talking about something that is totally focused only on technologies but also on best practices, approaches and the good path for a natural adoption. I'm trying to understand if contributing on one side (ECM Standards) can help me to design and improve our Apache brochures. On the other hand the Apache areas can be also useful for the new white papers. Please let me know what you think. Thank you. Cheers, PJ -- Piergiorgio