Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-10 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2221002267 Will do. Thank you for the help. With the above `commons-io` suggestion, everything looks good now. Will be doing more testing. -- This is an automated message from the Apach

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-10 Thread via GitHub
THausherr commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2219959205 This is really getting off-topic, please post to the tika users mailing list (don't forget to subscribe) https://lists.apache.org/list.html?u...@tika.apache.orgsee bottom left or

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-09 Thread via GitHub
THausherr commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2219468516 It worked for me with small changes because your code isn't runnable: ``` Path input = Paths.get("samplepptx.pptx"); Writer writer = new OutputStreamWriter(Syste

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-09 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2218995629 Thank you. That worked but I bumped into a new issue now after working through few other huccups. I am trying to parse a ppt file. ``` import org.apache.tika.io.TikaInput

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-08 Thread via GitHub
THausherr commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2216338193 Try `TikaCoreProperties` instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-08 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2216056172 I had to revert to `3.0.0-BETA` instead of `3.0.0-SNAPSHOT` due to dependencies in our code. Running into this issue when I use `BETA` version. ``` [2024-07-08T22

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-08 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2215277124 Thank you @tballison . This worked. However, we have a lot of dependencies on the version to be a release. Any idea when new TIKA version be released? just so that we can put it in

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-08 Thread via GitHub
THausherr commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2214741939 It's not in maven central. Add this to your pom.xml ``` id1 https://repository.apache.org/snapshots/

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-08 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2214690992 But which repo should I point to in `.pom` file? I tried using 3.0.0-SNAPSHOT or 3.0.0 in .pom file but can't find it. ``` Could not find artifact org.apache.tika:tika-

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-07 Thread via GitHub
THausherr commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2212942831 There's been plans to do another alpha soon. Snapshots are here: https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/3.0.0-SNAPSHOT/ -- This is an automate

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-07 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2212556752 Thank you @THausherr. These errors are occuring for a bad pdf file. Even with these errors ignoring and repairing, we are able to process it fine now. Any idea when we plan t

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-05 Thread via GitHub
THausherr commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2211609822 There has been a complaint about the NISC18030.ttf font in the past: PDFBOX-5743, and I can see in the browser that I searched for it and found it at https://github.com/justrajdeep/fonts/b

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-05 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2211353100 yeah, I see. There are more of these. Since `ioexception` is thrown, we are failing. Is there anything I can do to avoid these errors? why are they occuring? ``` 23:32:5

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-03 Thread via GitHub
THausherr commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2207940611 Please look at the rest of the log output. IIRC this is a problem with `lastresortfont.otf` when the initial scanning is done. But that font is skipped and life continues. -- This is an

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-03 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2207526085 I was able to use tika `3.0.0-BETA` and the `pdfbox` is at `3.0.2`. Seeing this issue - any ideas? am I missing anything? ``` java.io.IOException: Invalid character cod

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-03 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2207307712 I tried using `3.0.0-SNAPSHOT` in `.pom` file but can't find it. ``` Could not find artifact org.apache.tika:tika-core:jar:3.0.0 in central (https://repo1.maven.org/maven2/)

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-03 Thread via GitHub
tballison commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2207177307 If you can pull from the Apache snapshots repo, you can grab it from there? https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-parsers-standard-package/3.0.0

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-03 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2207164473 Sorry, I cannot use 3.0.0-BETA, I can only use 2.9.2. ``` org.apache.tika tika-parsers-standard-package 2.9.2 ``` Is there a way to use l

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-03 Thread via GitHub
tballison commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2207132252 Y, it should have been in 3.0.0-BETA. How are you using it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-07-03 Thread via GitHub
kbachuHighSpot commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2207113861 This is great. Thanks for working on this. Is this released with 3.0.0 thats in Beta? Because its listed here - https://tika.apache.org/3.0.0-BETA/index.html We are blocked o

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-05-07 Thread via GitHub
tballison commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2098906252 I just asked on our dev list. I'd like to get 3.x out soon. We need a beta2 release, though, I think. -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-05-07 Thread via GitHub
dsvensson commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2098585131 @tballison Will this be backported to Tika 2.x, or if not, how far off is Tika 3.x? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2024-05-07 Thread via GitHub
danielstravito commented on PR #1473: URL: https://github.com/apache/tika/pull/1473#issuecomment-2098582675 @tballison Will this be backported to Tika 2.x, or if not, how far off is Tika 3.x? -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [PR] TIKA-3347 -- upgrade to PDFBox 3.x [tika]

2023-12-01 Thread via GitHub
tballison merged PR #1473: URL: https://github.com/apache/tika/pull/1473 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org