[ 
https://issues.apache.org/jira/browse/TIKA-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bear R Giles updated TIKA-4353:
-------------------------------
    Description: 
The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements 
and attributes. While it's easy for the user to copy-and-paste this file for a 
custom parser it requires some effort to understand how to make the required 
changes. It's also a one-off effort - this work can't be reused elsewhere.

Given that there's already a dependency on JSoup... a far better solution is to 
create a parser that accepts a Safelist instead of using a hardcoded list. This 
Safelist can be validated and used elsewhere, and perhaps more importantly it 
makes the transition from a jsoup-based solution to a tika-based solution much 
more transparent.

NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.
h2. Preliminary implementation

I have a preliminary implementation that's not ready for a POC pull request - 
yet.
h3. HtmlParserWithSafelist

This parser is a very stripped down copy of the DefaultHtmlParser. It has 
removed all existing static elements and replaced them with the appropriate 
calls to Safelist methods.

This parser also includes a few proposed improvements:
 * it captures 'unsafe' elements and attributes. This allows developers to 
finetune their own Safelist implementations
 * it adds optional support for the 'data-*' wildcard.  This is a HTML5(?) 
standard intended to eliminate custom attributes

h3. DefaultHtmlSafelist

The jsoup Safelist already provides a few reference implementations but they 
don't fit our needs.  This class adds two. In addition it adds support for 
wildcard attributes beyond the "data-*" mentioned earlier.

*DEFAULT*

This implementation reproduces the existing behavior with a few improvements
 * <source> (since it contains an external reference)
 * <form> (since "action" can be an embedded script
 * <button> and <input> since they have a "formaction" attribute
 * all global attributes
 * all form_control, mouse, keyboard, and clipboard events
 * <body> and all window events
 * <head> (just for completelness with <body>)

IIRC the existing elements have added a few new attributes with HTML5 but I 
haven't addressed tha

*HTML5*

This implementation adds many  new HTML5 tags, with an emphasis on the tags 
that provide semantic context. E.g., <section>, <article>, <time>, etc.

  was:
The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements 
and attributes. While it's easy for the user to copy-and-paste this file for a 
custom parser it requires some effort to understand how to make the required 
changes. It's also a one-off effort - this work can't be reused elsewhere.

Given that there's already a dependency on JSoup... a far better solution is to 
create a parser that accepts a Safelist instead of using a hardcoded list. This 
Safelist can be validated and used elsewhere, and perhaps more importantly it 
makes the transition from a jsoup-based solution to a tika-based solution much 
more transparent.

NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.
h2. Preliminary implementation

I have a preliminary implementation that's not ready for a POC pull request - 
yet.

HtmlParserWithSafelist

This parser is a very stripped down copy of the DefaultHtmlParser. It has 
removed all existing static elements and replaced them with the appropriate 
calls to Safelist methods.

This parser also includes a few proposed improvements:
 * it captures 'unsafe' elements and attributes. This allows developers to 
finetune their own Safelist implementations
 * it adds optional support for the 'data-*' wildcard.  This is a HTML5(?) 
standard intended to eliminate custom attributes

h3. DefaultHtmlSafelist

The jsoup Safelist already provides a few reference implementations but they 
don't fit our needs.  This class adds two. In addition it adds support for 
wildcard attributes beyond the "data-*" mentioned earlier.

DEFAULT

This implementation reproduces the existing behavior with a few improvements
 * <source> (since it contains an external reference)
 * <form> (since "action" can be an embedded script
 * <button> and <input> since they have a "formaction" attribute
 * all global attributes
 * all form_control, mouse, keyboard, and clipboard events
 * <body> and all window events
 * <head> (just for completelness with <body>)

IIRC the existing elements have added a few new attributes with HTML5 but I 
haven't addressed tha

HTML5

This implementation adds many  new HTML5 tags, with an emphasis on the tags 
that provide semantic context. E.g., <section>, <article>, <time>, etc.


> Implement HtmlParserWithSafelist that uses a standard jsoup Safelist for 
> filtering.
> -----------------------------------------------------------------------------------
>
>                 Key: TIKA-4353
>                 URL: https://issues.apache.org/jira/browse/TIKA-4353
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Bear R Giles
>            Priority: Minor
>
> The current DefaultHtmlParser uses a hardcoded list of acceptable HTML 
> elements and attributes. While it's easy for the user to copy-and-paste this 
> file for a custom parser it requires some effort to understand how to make 
> the required changes. It's also a one-off effort - this work can't be reused 
> elsewhere.
> Given that there's already a dependency on JSoup... a far better solution is 
> to create a parser that accepts a Safelist instead of using a hardcoded list. 
> This Safelist can be validated and used elsewhere, and perhaps more 
> importantly it makes the transition from a jsoup-based solution to a 
> tika-based solution much more transparent.
> NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.
> h2. Preliminary implementation
> I have a preliminary implementation that's not ready for a POC pull request - 
> yet.
> h3. HtmlParserWithSafelist
> This parser is a very stripped down copy of the DefaultHtmlParser. It has 
> removed all existing static elements and replaced them with the appropriate 
> calls to Safelist methods.
> This parser also includes a few proposed improvements:
>  * it captures 'unsafe' elements and attributes. This allows developers to 
> finetune their own Safelist implementations
>  * it adds optional support for the 'data-*' wildcard.  This is a HTML5(?) 
> standard intended to eliminate custom attributes
> h3. DefaultHtmlSafelist
> The jsoup Safelist already provides a few reference implementations but they 
> don't fit our needs.  This class adds two. In addition it adds support for 
> wildcard attributes beyond the "data-*" mentioned earlier.
> *DEFAULT*
> This implementation reproduces the existing behavior with a few improvements
>  * <source> (since it contains an external reference)
>  * <form> (since "action" can be an embedded script
>  * <button> and <input> since they have a "formaction" attribute
>  * all global attributes
>  * all form_control, mouse, keyboard, and clipboard events
>  * <body> and all window events
>  * <head> (just for completelness with <body>)
> IIRC the existing elements have added a few new attributes with HTML5 but I 
> haven't addressed tha
> *HTML5*
> This implementation adds many  new HTML5 tags, with an emphasis on the tags 
> that provide semantic context. E.g., <section>, <article>, <time>, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to