[ https://issues.apache.org/jira/browse/TIKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865297#comment-17865297 ]
Tilman Hausherr commented on TIKA-4277: --------------------------------------- To see what parameters are available and how to use them, do this: {noformat} java -jar tika-app-VERSION.jar --config=config.xml --dump-current-config {noformat} I get this: {code:xml} <?xml version="1.0" encoding="UTF-8" standalone="no"?> <properties> <!--for example: <mimeTypeRepository resource="/org/apache/tika/mime/tika-mimetypes.xml"/>--> <service-loader dynamic="true" loadErrorHandler="IGNORE"/> <encodingDetectors> <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/> </encodingDetectors> <translator class="org.apache.tika.language.translate.DefaultTranslator"/> <detectors> <detector class="org.apache.tika.detect.DefaultDetector"/> </detectors> <parsers> <parser class="org.apache.tika.parser.DefaultParser"/> <parser class="org.apache.tika.parser.pdf.PDFParser"> <params> <param name="allowExtractionForAccessibility" type="bool">true</param> <param name="averageCharTolerance" type="float">0.3</param> <param name="catchIntermediateExceptions" type="bool">true</param> <param name="detectAngles" type="bool">true</param> <param name="dropThreshold" type="float">2.5</param> <param name="enableAutoSpace" type="bool">true</param> <param name="extractAcroFormContent" type="bool">true</param> <param name="extractActions" type="bool">false</param> <param name="extractAnnotationText" type="bool">true</param> <param name="extractBookmarksText" type="bool">true</param> <param name="extractFontNames" type="bool">false</param> <param name="extractIncrementalUpdateInfo" type="bool">false</param> <param name="extractInlineImageMetadataOnly" type="bool">false</param> <param name="extractInlineImages" type="bool">false</param> <param name="extractMarkedContent" type="bool">false</param> <param name="extractUniqueInlineImagesOnly" type="bool">true</param> <param name="ifXFAExtractOnlyXFA" type="bool">false</param> <param name="imageStrategy" type="string">NONE</param> <param name="maxIncrementalUpdates" type="int">10</param> <param name="maxMainMemoryBytes" type="long">536870912</param> <param name="ocrDPI" type="int">300</param> <param name="ocrImageFormatName" type="string">png</param> <param name="ocrImageQuality" type="float">1.0</param> <param name="ocrImageType" type="string">GRAY</param> <param name="ocrRenderingStrategy" type="string">ALL</param> <param name="ocrStrategy" type="string">AUTO</param> <param name="ocrStrategyAuto" type="string">10,10</param> <param name="parseIncrementalUpdates" type="bool">false</param> <param name="setKCMS" type="bool">false</param> <param name="sortByPosition" type="bool">true</param> <param name="spacingTolerance" type="float">0.5</param> <param name="suppressDuplicateOverlappingText" type="bool">false</param> <param name="throwOnEncryptedPayload" type="bool">false</param> </params> <imageGraphicsEngineFactory class="org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory"/> </parser> </parsers> </properties> {code} > PDF parse issue for text rotated > -------------------------------- > > Key: TIKA-4277 > URL: https://issues.apache.org/jira/browse/TIKA-4277 > Project: Tika > Issue Type: Bug > Components: tika-app, tika-server > Affects Versions: 3.0.0-BETA, 2.9.2 > Reporter: ragebear > Priority: Major > Attachments: OtherPDFReader.png, sample2.pdf > > > the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta > The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in > server version and the standalone. > if the text is rotated 90. The parsed result will have a line break after > each letter of word. It happened to symbol, English letters, and JCK > characters. > In the server version, curl -g -T "sample2.pdf" > [http://localhost:889/tika] > --header "Accept: text/plain" > In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" > --text > Both of above, deliver the the incorrect result in the attached pdf. > The output result is below > i > n > s > e > r > t > > t > e > x > t > > p > r > o > b > l > e > m > insert text problem -- This message was sent by Atlassian Jira (v8.20.10#820010)