[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178402#comment-16178402 ]
ASF GitHub Bot commented on TIKA-2400: -------------------------------------- thammegowda commented on a change in pull request #208: Fix for TIKA-2400 Standardizing current Object Recognition REST parsers URL: https://github.com/apache/tika/pull/208#discussion_r140670497 ########## File path: tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java ########## @@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream, ContentHandler handler, Metad for (RecognisedObject object : objects) { if (object instanceof CaptionObject) { if (xhtmlStartVal == null) xhtmlStartVal = "captions"; - LOG.debug("Add {}", object); - String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", - object.getLabel(), object.getConfidence()); - metadata.add(MD_KEY_IMG_CAP, mdValue); - acceptedObjects.add(object); + String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); + metadata.add(MD_KEY_IMG_CAP, mdVal); xhtmlIds.add(String.valueOf(count++)); } else { if (xhtmlStartVal == null) xhtmlStartVal = "objects"; - if (object.getConfidence() >= minConfidence) { - count++; - LOG.info("Add {}", object); - String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)", - object.getLabel(), object.getConfidence()); - metadata.add(MD_KEY_OBJ_REC, mdValue); - acceptedObjects.add(object); - xhtmlIds.add(object.getId()); - if (count >= topN) { - break; - } - } else { - LOG.warn("Object {} confidence {} less than min {}", object, object.getConfidence(), minConfidence); - } + String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)", object.getLabel(), object.getConfidence()); + metadata.add(MD_KEY_OBJ_REC, mdVal); + xhtmlIds.add(object.getId()); } + LOG.info("Add {}", object); Review comment: > will be great if you can remove String concatenation from RecognisedObject.toString to use StringBuffer or String format If you suggested this for performance gain, Let's take a deeper look. `RecognisedObject.toString()` does not run over a loop. Its just one giant concatenation with `+`. I remember reading somewhere that JDK can easily optimize such statement, but I couldn't find the source of this knowledge now so I am giving you this test : ```java class Main { public static long concat(int n){ long st = System.nanoTime(); for (int i = 0; i < n; i++) { String s = "a" + "b" + "c" + "d" + "e" + "f" + "g" + "h" + "i" + "j" +"k"; } return System.nanoTime() - st; } public static long builder(int n){ long st = System.nanoTime(); for (int i = 0; i < n; i++) { String s = new StringBuilder().append("a").append("b") .append("c").append("d").append("e").append("f") .append("g").append("h").append("i").append("j") .append("k").toString(); } return System.nanoTime() - st; } public static void main(String[] args) { int n = 1_000_000; System.out.printf("Builder Time in ns : %10d\n", builder(n)); System.out.printf(" Concat Time in ns : %10d\n", concat(n)); } } ``` I ran it on https://repl.it/languages/java ``` java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Builder Time in ns : 50614748 Concat Time in ns : 2500615 ``` see, it's in fact better!! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Standardizing current Object Recognition REST parsers > ----------------------------------------------------- > > Key: TIKA-2400 > URL: https://issues.apache.org/jira/browse/TIKA-2400 > Project: Tika > Issue Type: Sub-task > Components: parser > Reporter: Thejan Wijesinghe > Priority: Minor > Fix For: 1.17 > > > # This involves adding apiBaseUris and refactoring current Object Recognition > REST parsers, > # Refactoring dockerfiles related to those parsers. > # Moving the logic related to checking minimum confidence into servers -- This message was sent by Atlassian JIRA (v6.4.14#64029)