[ 
https://issues.apache.org/jira/browse/TIKA-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas DiPiazza updated TIKA-2805:
------------------------------------
    Description: 
The tika's HTML parser will take this:
{code:java}
<noscript><div class='noindex'>You may be trying to access this site from a 
secured browser on the server. Please enable scripts and reload this 
page.</div></noscript>{code}
and will parse it:
{code:java}
You may be trying to access this site from a secured browser on the server. 
Please enable scripts and reload this page.{code}
Shouldn't it just ignore those sections and leave those out of the parse 
output? 

  was:
The tika parser will take this:
{code:java}
<noscript><div class='noindex'>You may be trying to access this site from a 
secured browser on the server. Please enable scripts and reload this 
page.</div></noscript>{code}
and will parse it:
{code:java}
You may be trying to access this site from a secured browser on the server. 
Please enable scripts and reload this page.{code}
Shouldn't it just ignore those sections and leave those out of the parse 
output? 


> Should the HTML parser by default just ignore the <noscript> section?
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2805
>                 URL: https://issues.apache.org/jira/browse/TIKA-2805
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> The tika's HTML parser will take this:
> {code:java}
> <noscript><div class='noindex'>You may be trying to access this site from a 
> secured browser on the server. Please enable scripts and reload this 
> page.</div></noscript>{code}
> and will parse it:
> {code:java}
> You may be trying to access this site from a secured browser on the server. 
> Please enable scripts and reload this page.{code}
> Shouldn't it just ignore those sections and leave those out of the parse 
> output? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to