[ https://issues.apache.org/jira/browse/TIKA-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370662#comment-17370662 ]
Tim Allison edited comment on TIKA-3450 at 6/28/21, 3:36 PM: ------------------------------------------------------------- Starting the current snapshot with {{java -jar tika-server-1.27-SNAPSHOT.jar -spawnChild -JXmx512}}, I get a successful parse with {{curl -T ~/Downloads/test\ large\ csv.csv http://localhost:9998/tika -H "Accept: application/json"}}. If I go down to -JXmx256m, I get an OOM. This is consistent with tika-app. I am able to replicate the behavior you're seeing if I add language detection with the Optimaize language detector. In trying to repro this, when I add the full extracted string to Optimaize, it doesn't truncate at the normal 20k length, rather it uses the full string. The length checks only work if you add the string bit by bit...it turns out. When I add: {noformat} String txt = contentHandler.toString(); String langDetect = txt.length() > 20000 ? txt.substring(0, 20000) : txt; LanguageResult r = detector.detect(langDetect); {noformat} I'm able to parse the file w 512m. If this is actually the cause, I'd rule this a bug in Optimaize and/or our LanguageDetector. was (Author: talli...@mitre.org): Starting the current snapshot with {{java -jar tika-server-1.27-SNAPSHOT.jar -spawnChild -JXmx512}}, I get a successful parse with {{curl -T ~/Downloads/test\ large\ csv.csv http://localhost:9998/tika -H "Accept: application/json"}}. If I go down to -JXmx256m, I get an OOM. This is consistent with tika-app. I am able to replicate the behavior you're seeing if I add language detection with the Optimaize language detector. In trying to repro this, when I add the full extracted string to Optimaize, it doesn't truncate at the normal 20k length, rather it uses the full string. The length checks only work if you add the string bit by bit... If you are using optimaize and this is the cause, I'd rule this a bug in Optimaize and/or our LanguageDetector. > CSV parsing consumes an exorbitant amount of memory/heap space when using > server JSON endpoint > ---------------------------------------------------------------------------------------------- > > Key: TIKA-3450 > URL: https://issues.apache.org/jira/browse/TIKA-3450 > Project: Tika > Issue Type: Bug > Components: parser, server > Affects Versions: 1.27 > Reporter: Carey Halton > Priority: Major > Attachments: test large csv.zip > > > We've observed an issue where parsing large CSV files for some reason takes a > ridiculous amount of heap space and memory, seemingly unbounded (we haven't > been able to set a threshold that they succeed under and we've gone up to 4gb > heap space). This repros when we utilize Tika server's new JSON extraction > endpoint, and it repros when we use the default TextAndCSVParser as well as > if we configure it to use the older TXTParser instead. For some reason it > doesn't repro when using a non-JSON extraction endpoint (though the request > does still take a few minutes in that case), so I wonder if there is some > recursion issue going on (didn't try with rmeta). In both cases it seems like > the large file is being held as multiple character arrays in memory at once > for some reason, and then there is also an extremely large object array that > contains each character as a string and then some. > I have a test file that reproduces the issue, but it looks like Jira won't > let me upload it (it is just under the 60mb limit but I get the "An internal > error has occurred." message). I also have sample repro heap dumps that I can > share (one with each parser setup) but they are definitely too large to > upload to Jira at all (since they are each approximately 4gb). Let me know if > there is a way I can easily share these to help showcase the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)