[ https://issues.apache.org/jira/browse/TIKA-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas DiPiazza updated TIKA-2805: ------------------------------------ Description: The tika's HTML parser will take this: {code:java} <noscript><div class='noindex'>You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.</div></noscript>{code} and will parse it: {code:java} You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.{code} Shouldn't it just ignore those sections and leave those out of the parse output? was: The tika parser will take this: {code:java} <noscript><div class='noindex'>You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.</div></noscript>{code} and will parse it: {code:java} You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page.{code} Shouldn't it just ignore those sections and leave those out of the parse output? > Should the HTML parser by default just ignore the <noscript> section? > --------------------------------------------------------------------- > > Key: TIKA-2805 > URL: https://issues.apache.org/jira/browse/TIKA-2805 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Nicholas DiPiazza > Priority: Major > > The tika's HTML parser will take this: > {code:java} > <noscript><div class='noindex'>You may be trying to access this site from a > secured browser on the server. Please enable scripts and reload this > page.</div></noscript>{code} > and will parse it: > {code:java} > You may be trying to access this site from a secured browser on the server. > Please enable scripts and reload this page.{code} > Shouldn't it just ignore those sections and leave those out of the parse > output? -- This message was sent by Atlassian JIRA (v7.6.3#76005)