[ https://issues.apache.org/jira/browse/TIKA-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bear R Giles updated TIKA-4353: ------------------------------- Description: The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements and attributes. While it's easy for the user to copy-and-paste this file for a custom parser it requires some effort to understand how to make the required changes. It's also a one-off effort - this work can't be reused elsewhere. Given that there's already a dependency on JSoup... a far better solution is to create a parser that accepts a Safelist instead of using a hardcoded list. This Safelist can be validated and used elsewhere, and perhaps more importantly it makes the transition from a jsoup-based solution to a tika-based solution much more transparent. NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser. h2. Preliminary implementation I have a preliminary implementation that's not ready for a POC pull request - yet. h3. HtmlParserWithSafelist This parser is a very stripped down copy of the DefaultHtmlParser. It has removed all existing static elements and replaced them with the appropriate calls to Safelist methods. This parser also includes a few proposed improvements: * it captures 'unsafe' elements and attributes. This allows developers to finetune their own Safelist implementations * it adds optional support for the 'data-*' wildcard. This is a HTML5(?) standard intended to eliminate custom attributes h3. DefaultHtmlSafelist The jsoup Safelist already provides a few reference implementations but they don't fit our needs. This class adds two. In addition it adds support for wildcard attributes beyond the "data-*" mentioned earlier. *DEFAULT* This implementation reproduces the existing behavior with a few improvements * <source> (since it contains an external reference) * <form> (since "action" can be an embedded script * <button> and <input> since they have a "formaction" attribute * all global attributes * all form_control, mouse, keyboard, and clipboard events * <body> and all window events * <head> (just for completelness with <body>) IIRC the existing elements have added a few new attributes with HTML5 but I haven't addressed tha *HTML5* This implementation adds many new HTML5 tags, with an emphasis on the tags that provide semantic context. E.g., <section>, <article>, <time>, etc. was: The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements and attributes. While it's easy for the user to copy-and-paste this file for a custom parser it requires some effort to understand how to make the required changes. It's also a one-off effort - this work can't be reused elsewhere. Given that there's already a dependency on JSoup... a far better solution is to create a parser that accepts a Safelist instead of using a hardcoded list. This Safelist can be validated and used elsewhere, and perhaps more importantly it makes the transition from a jsoup-based solution to a tika-based solution much more transparent. NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser. h2. Preliminary implementation I have a preliminary implementation that's not ready for a POC pull request - yet. HtmlParserWithSafelist This parser is a very stripped down copy of the DefaultHtmlParser. It has removed all existing static elements and replaced them with the appropriate calls to Safelist methods. This parser also includes a few proposed improvements: * it captures 'unsafe' elements and attributes. This allows developers to finetune their own Safelist implementations * it adds optional support for the 'data-*' wildcard. This is a HTML5(?) standard intended to eliminate custom attributes h3. DefaultHtmlSafelist The jsoup Safelist already provides a few reference implementations but they don't fit our needs. This class adds two. In addition it adds support for wildcard attributes beyond the "data-*" mentioned earlier. DEFAULT This implementation reproduces the existing behavior with a few improvements * <source> (since it contains an external reference) * <form> (since "action" can be an embedded script * <button> and <input> since they have a "formaction" attribute * all global attributes * all form_control, mouse, keyboard, and clipboard events * <body> and all window events * <head> (just for completelness with <body>) IIRC the existing elements have added a few new attributes with HTML5 but I haven't addressed tha HTML5 This implementation adds many new HTML5 tags, with an emphasis on the tags that provide semantic context. E.g., <section>, <article>, <time>, etc. > Implement HtmlParserWithSafelist that uses a standard jsoup Safelist for > filtering. > ----------------------------------------------------------------------------------- > > Key: TIKA-4353 > URL: https://issues.apache.org/jira/browse/TIKA-4353 > Project: Tika > Issue Type: Improvement > Reporter: Bear R Giles > Priority: Minor > > The current DefaultHtmlParser uses a hardcoded list of acceptable HTML > elements and attributes. While it's easy for the user to copy-and-paste this > file for a custom parser it requires some effort to understand how to make > the required changes. It's also a one-off effort - this work can't be reused > elsewhere. > Given that there's already a dependency on JSoup... a far better solution is > to create a parser that accepts a Safelist instead of using a hardcoded list. > This Safelist can be validated and used elsewhere, and perhaps more > importantly it makes the transition from a jsoup-based solution to a > tika-based solution much more transparent. > NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser. > h2. Preliminary implementation > I have a preliminary implementation that's not ready for a POC pull request - > yet. > h3. HtmlParserWithSafelist > This parser is a very stripped down copy of the DefaultHtmlParser. It has > removed all existing static elements and replaced them with the appropriate > calls to Safelist methods. > This parser also includes a few proposed improvements: > * it captures 'unsafe' elements and attributes. This allows developers to > finetune their own Safelist implementations > * it adds optional support for the 'data-*' wildcard. This is a HTML5(?) > standard intended to eliminate custom attributes > h3. DefaultHtmlSafelist > The jsoup Safelist already provides a few reference implementations but they > don't fit our needs. This class adds two. In addition it adds support for > wildcard attributes beyond the "data-*" mentioned earlier. > *DEFAULT* > This implementation reproduces the existing behavior with a few improvements > * <source> (since it contains an external reference) > * <form> (since "action" can be an embedded script > * <button> and <input> since they have a "formaction" attribute > * all global attributes > * all form_control, mouse, keyboard, and clipboard events > * <body> and all window events > * <head> (just for completelness with <body>) > IIRC the existing elements have added a few new attributes with HTML5 but I > haven't addressed tha > *HTML5* > This implementation adds many new HTML5 tags, with an emphasis on the tags > that provide semantic context. E.g., <section>, <article>, <time>, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)