dpol1 commented on code in PR #1943:
URL: https://github.com/apache/stormcrawler/pull/1943#discussion_r3418782414


##########
core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java:
##########
@@ -272,79 +280,91 @@ public void execute(Tuple tuple) {
         try {
             String html = 
Charset.forName(charset).decode(ByteBuffer.wrap(content)).toString();
 
-            jsoupDoc = Parser.htmlParser().parseInput(html, url);
-
-            if (!robotsMetaSkip) {
-                // extracts the robots directives from the meta tags
-                Element robotelement = 
jsoupDoc.selectFirst("meta[name~=(?i)robots][content]");
-                if (robotelement != null) {
-                    robotsTags.extractMetaTags(robotelement.attr("content"));
-                }
-            }
-
-            // store a normalised representation in metadata
-            // so that the indexer is aware of it
-            robotsTags.normaliseToMetadata(metadata);
-
-            // do not extract the links if no follow has been set
-            // and we are in strict mode
-            if (robotsTags.isNoFollow() && robotsNoFollowStrict) {
+            if (isPlainText) {
+                // no markup to parse: the decoded content is the text itself 
and
+                // there are no outlinks. An empty shell document is kept so 
that
+                // the downstream redirection check and parse filters still 
work.
+                jsoupDoc = org.jsoup.nodes.Document.createShell(url);
                 slinks = new HashMap<>(0);
+                robotsTags.normaliseToMetadata(metadata);
+                text = html;

Review Comment:
     I'd go with Option A. A plain-text file is already its own text, which is 
the whole point of #466, and the test asserts it's stored verbatim, newlines 
and all. Option B runs the content through `appendNormalisedText`, which 
collapses whitespace, so logs, source files and tabular dumps lose their 
layout. It would also force relaxing that test. For the one format where 
whitespace is the content, that's a poor trade.
   
     The downside of Option A, re-reading the two config keys, is smaller than 
it looks. For plain text only `no.text` and `skip.after` have any effect; 
`include.pattern` and `exclude.tags` need markup, so there's nothing else to 
honor. If the custom `textextractor.class` case still bothers you, a cleaner 
option is a `text(String)` overload on the extractor that applies those two 
limits without normalizing, so both code paths share one implementation. I'd 
treat that as a follow-up, not part of this PR.
   
     One open question: should plain text use the `TextExtractor` knobs at all, 
or is `http.content.limit` the more honest bound for raw bytes? Either way, 
Option A plus a test for the `skip.after` truncation works for me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to