Re: How to increase ZIP bomb maximum depth

2019-08-26 Thread Jukka Zitting
Hi, I wonder if we should just increase the default thresholds to allow deeper nesting before the exception gets thrown. The defaults should be tuned to make the false-positive rate as low as possible without opening the door for false negatives that could result denial of service attacks. The pa

Re: Thread-safety and locking of methods Tika.detect(...) and MimeType.detect(...)

2018-05-17 Thread Jukka Zitting
Hi, Based on the Xerces discussion it sounds like using a pool of parsers would be the best approach. Best, Jukka On Thu, May 17, 2018 at 11:51 AM, Sebastian Nagel wrote: > Hi, > > two questions regarding thread-safety and locking in Tika's MIME type > detectors > while investigating global l

Re: disable extraction of images

2016-04-13 Thread Jukka Zitting
, but IIRC such a feature doesn't currently exist in Tika or the underlying PDFBox library. Best, Jukka Zitting On Wed, Apr 13, 2016 at 8:52 AM ron.vandenbranden < ron.vandenbran...@kantl.be> wrote: > Hi again, > > > On 13/04/2016 13:18, ron.vandenbranden wrote: > > >

Re: MagicDetector does not enforce mark/reset support in inputstream

2015-06-19 Thread Jukka Zitting
You can make the test pass by changing the assertion to: assertTrue(IOUtils.contentEquals(stream, originalStream)); Wrapping a stream with TikaInputStream doesn't magically add mark/reset support to the original stream; only the wrapper instance has this feature.

Re: MagicDetector does not enforce mark/reset support in inputstream

2015-06-18 Thread Jukka Zitting
eature. As suggested by Nick, an easy way to meet that API contract is for the client to wrap a stream into TikaInputStream before passing it to the detector. [1] http://tika.apache.org/1.8/api/org/apache/tika/detect/Detector.html#detect(java.io.InputStream,%20org.apache.tika.metadata.Metadata)

Re: xml vs html parser

2015-06-16 Thread Jukka Zitting
t, so the lack of them coupled with even some fairly weak SGML detection signals (stuff like upper case element names?) might be enough to get significant improvements in this area. BR, Jukka Zitting

Re: How to identify binary content ?

2014-08-14 Thread Jukka Zitting
er.class, new IdentityHtmlMapper(); -- Jukka Zitting

Re: IOException should be TikaException?

2014-07-01 Thread Jukka Zitting
ot leverage that functionality, mostly because the underlying parsers should already throw the correct exceptions. Thus it would be better to fix this in PDFParser instead of AutoDetectParser. BR, Jukka Zitting

Re: IOException should be TikaException?

2014-07-01 Thread Jukka Zitting
ggedInputStream(stream); try { parse(tagged); } catch (IOException e) { tagged.throwIfCauseOf(e); // throws IOException if from stream throw new TikaException("Parse error", e); } [1] http://tika.apache.org/1.0/api/org/apache/tika/io/TaggedInputStream.html BR, Jukka Zitting

Re: Parsers, DefaultConfig and such

2014-03-14 Thread Jukka Zitting
Hi, On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll wrote: > On Mar 13, 2014, at 3:53 PM, Jukka Zitting wrote: >> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll wrote: >>> But why would that test fail in the Tika dev environment? >> >> The Defa

Re: Parsers, DefaultConfig and such

2014-03-13 Thread Jukka Zitting
specify the content type in the input metadata, it won't know how to parse the document. BR, Jukka Zitting

Re: Unsubscribe

2013-09-26 Thread Jukka Zitting
Hi, To unsubscribe this list, send a message to user-unsurbscr...@tika.apache.org. See http://tika.apache.org/mail-lists.html for more details. BR, Jukka Zitting

Re: Not Parsing HTML Elements with a class

2013-04-09 Thread Jukka Zitting
would otherwise be swallowed by the DefaultHtmlMapper strategy. You can write a custom ContentHandler class that detects the "donotparse" attributes and skips all content within such elements. BR, Jukka Zitting

Re: Not Parsing HTML Elements with a class

2013-04-08 Thread Jukka Zitting
per class that implements the class="donotparse" strategy you describe. This approach requires changes in Tika, so you might want to consider submitting a patch of your (ideally backwards-compatible) changes. BR, Jukka Zitting

Re: Releasing TikaInputStream resources

2013-04-01 Thread Jukka Zitting
? The code that instantiates the TikaInputStream should also take care of disposing it properly. If that happens, you shouldn't experience the filling up of the /tmp space that you described. BR, Jukka Zitting

Re: Releasing TikaInputStream resources

2013-03-28 Thread Jukka Zitting
a.io.InputStream, org.apache.tika.io.TemporaryResources) BR, Jukka Zitting

Re: Issue Using Tika to Parse Sling Node Files

2013-02-18 Thread Jukka Zitting
ing, do you have just tika-core deployed (AFAIUI that's the default with Sling)? The core bundle doesn't contain any parser components, so it won't be able to extract text from any documents. Deploying tika-bundle along with core should fix that. BR, Jukka Zitting

Re: Issue Using Tika to Parse Sling Node Files

2013-02-18 Thread Jukka Zitting
xt > contained within a document, which will be returned as a string. Currently, > I have my program set up in the following way: Have you tried: new Tika().parseToString(node.getBinary().getStream()) That should cover your use case and be much simpler than what you're now doing. BR, Jukka Zitting

Re: PDF parse failing to capture entire text

2013-01-11 Thread Jukka Zitting
#x27;t (for example if it's a scanned image), then there's little we can do. BR, Jukka Zitting

Re: Tika with solrCloud

2012-12-10 Thread Jukka Zitting
ent. We should probably also make Tika degrade more gracefully when a particular encoding detector is not present. [1] http://code.google.com/p/juniversalchardet/ [2] http://search.maven.org/#artifactdetails%7Ccom.googlecode.juniversalchardet%7Cjuniversalchardet%7C1.0.3%7Cjar BR, Jukka Zitting

Re: Using ArticleExtractor from BoilerPipe in Apache Tika

2012-12-07 Thread Jukka Zitting
lers like that. Assuming your textBuffer is a Writer instance, you could rather try replacing handler1 with something like this: ContentHandler handler1 = new WriteOutContentHandler(textBuffer); BR, Jukka Zitting

Re: Problem detecting Microsoft Office formats from InputStream

2012-09-23 Thread Jukka Zitting
ng file. The MS Office detectors (and a few other features in Tika) rely on that functionality, and thus won't give as accurate results when given just a plain InputStream instance. BR, Jukka Zitting

Re: Failing to detect SJIS

2012-09-03 Thread Jukka Zitting
eports generated by Maven aren't too useful (some are even misleading), which is why we're not including links to them in the site template. If there are individual reports (like the SCM page) that do make sense, then it would be a good idea to selectively add that to the template. BR, Jukka Zitting

Re: Failing to detect SJIS

2012-09-03 Thread Jukka Zitting
ect < shiftjs.txt # look only at the byte stream application/octet-stream $ java -jar tika-app.jar --detect shiftjs.txt # Give the file name with .txt ending as a type hint text/plain $ java -jar tika-app.jar --text shiftjs.txt # Check that the encoding is correctly detected 電子商取引(エレクトロニックコマース)、オンライン [...] Yes! BR, Jukka Zitting

Re: Article and section tags

2012-08-30 Thread Jukka Zitting
solution so how should it be solved? I think the idea solution would be to have these changes included directly in TagSoup. BR, Jukka Zitting

Re: Logging in Tika

2012-08-27 Thread Jukka Zitting
t use or require any specific logging framework, but some of the parser libraries do, so in a typical Tika deployment you'd already have at least the Commons Logging, SLF4J and JUL interfaces available for logging. BR, Jukka Zitting

Interest for Tika at ApacheCon Europe

2012-08-03 Thread Jukka Zitting
ar in Tika that you'd like to hear more about. [1] http://www.apachecon.eu/ BR, Jukka Zitting

Re: Zip bomb with BoilepipeContentHandler

2012-07-30 Thread Jukka Zitting
to use BoilerPipe on top of instead of inside AutoDetectParser. BR, Jukka Zitting

Re: Charset detection

2012-07-25 Thread Jukka Zitting
lasses. That way code that for example checks the type detection result against something like "text/plain" won't start failing with a Tika version that might decide to qualify the type with "text/plain; charset=UTF-8" or to return a more detailed media type like "text/x-java-source". BR, Jukka Zitting

Re: Charset detection

2012-07-25 Thread Jukka Zitting
run charset detection already earlier at that point? BR, Jukka Zitting

Re: tika-app-1.2.jar in server mode not responding (windows)

2012-07-22 Thread Jukka Zitting
ision=1066132 [2] http://svn.apache.org/viewvc?view=revision&revision=1091833 BR, Jukka Zitting

Re: tika-app-1.2.jar in server mode not responding (windows)

2012-07-22 Thread Jukka Zitting
Hi, On Sun, Jul 22, 2012 at 12:34 PM, Jukka Zitting wrote: > On Sun, Jul 22, 2012 at 2:23 AM, Oliver Steinau > wrote: >> However still no success with the tika-app. Here's what I tried (bear >> with me, I'm on a windows system...) > > The problem soun

Re: tika-app-1.2.jar in server mode not responding (windows)

2012-07-22 Thread Jukka Zitting
the end of stream from the client. BR, Jukka Zitting

Re: using tika with eclipse

2012-07-16 Thread Jukka Zitting
echanism is used to load services. And in any case the static service loading is a fairly cheap operation that's typically only done once during the lifetime of an application or a bundle. BR, Jukka Zitting

Re: Surpluss whitespace in outlink anchors not collapsed

2012-07-09 Thread Jukka Zitting
ectly in all such cases. > Do we have to remove surpluss whitespace in Nutch ourselves? I think that's the easiest solution here. BR, Jukka Zitting

Re: using tika with eclipse

2012-07-06 Thread Jukka Zitting
ethod of obtaining access to the Detector and Parser > involves something like this in your own bundles activator: The reason why we use ServiceTrackers instead is that we want to support deployments where new parser and detector services can be added or removed dynamically from the running system. BR, Jukka Zitting

Re: using tika with eclipse

2012-07-06 Thread Jukka Zitting
es no output, despite the file containing text. > Tika tika = new Tika(); > System.out.print(tika.parseToString(new FileInputStream(xmlFile))); See the BundleIT test case inside the tika-bundle component. That's a pretty similar piece of code that works fine in an OSGi environment. BR, Jukka Zitting

Re: using tika with eclipse

2012-07-06 Thread Jukka Zitting
Hi, On Fri, Jul 6, 2012 at 5:00 PM, Kevin Milburn wrote: > On 2012/07/05 18:22, Jukka Zitting wrote: >> upgrade to the latest 1.2 SNAPSHOT where declarative services is no longer >> needed (see https://issues.apache.org/jira/browse/TIKA-896). > > I've built and installe

Re: using tika with eclipse

2012-07-05 Thread Jukka Zitting
larative services is no longer needed (see https://issues.apache.org/jira/browse/TIKA-896). BR, Jukka Zitting

Re: Server mode documentation?

2012-07-01 Thread Jukka Zitting
st once you've already loaded the Tika classes to memory. The server mode is typically more interesting for non-Java clients that face the question of either executing tika-app separately for each document or accessing an already running server process. BR, Jukka Zitting

Re: Server mode documentation?

2012-07-01 Thread Jukka Zitting
e of the tika-app simply parses documents sent through a network connection programmatically or with a tool like netcat [1] and responds with the parse output as governed by the rest of the tika-app command line options. [1] http://netcat.sourceforge.net/ BR, Jukka Zitting

Re: Server mode documentation?

2012-07-01 Thread Jukka Zitting
system performance depends on your deployment details. BR, Jukka Zitting

Re: Server mode documentation?

2012-07-01 Thread Jukka Zitting
n hour or so as the CI build picks up that revision. BR, Jukka Zitting

Re: Tika doesn't parse any text from a specific

2012-06-25 Thread Jukka Zitting
OCR tooling. BR, Jukka Zitting

Re: ForkParser and Metadata

2012-05-01 Thread Jukka Zitting
Hi, Currently the ForkParser doesn't return metadata, though adding that feature shouldn't be too difficult. My original use case didn't need metadata, so I never implemented that bit. Jukka Zitting 1.5.2012 19.26 "Michael McCandless" kirjoitti: > Does anyone know

Re: Problem detecting XML

2012-04-17 Thread Jukka Zitting
ar tika-app-1.1.jar --detect sample_fixed.wde java -jar tika-app-1.1.jar --detect < sample_fixed.wde BR, Jukka Zitting

Re: Problem detecting XML

2012-04-17 Thread Jukka Zitting
also from just the byte stream. A typical reason why an XML document is detected as text/plain is if it's actually not valid XML, either because of some well-formedness issue (unclosed tags) or because of some extra characters like suggested by Nick. BR, Jukka Zitting

Re: Determine dependency jars for a parser

2012-03-14 Thread Jukka Zitting
nt parser classes to determine which libraries they're using. Or try removing dependencies until the parser you're interested in no longer works. BR, Jukka Zitting

Re: Tika XML Beans version dependency

2012-01-26 Thread Jukka Zitting
cause trouble for POI. Or you can just try upgrading the dependency and see if it works. :-) BR, Jukka Zitting

Re: File Content Type Detection

2012-01-26 Thread Jukka Zitting
mance reasons the container detection mechanism is skipped. Using new Tika().detect(new File(name)) takes care of all these details for you, which is why it's the recommended way to do type detection unless you explicitly need direct access to the lower-level functionality in Tika. BR, Jukka Zitting

Re: trouble with last character "?" whn using Mp3Parser metadata.get()

2012-01-20 Thread Jukka Zitting
have trouble accessing the central Maven repository. Do you have some firewall or HTTP proxy (perhaps a local Maven repository manager) that could be blocking your access? Try seeing if you can access http://repo1.maven.org/maven2/org/apache/apache/10/ directly in your browse. BR, Jukka Zitting

Re: Extract span tag

2012-01-05 Thread Jukka Zitting
Mapper.class, IdentityHtmlMapper.INSTANCE); Parser parser = ...; parser.parse(..., context); BR, Jukka Zitting

Re: parsers implementations for media files (mpeg, flv, webm)

2012-01-01 Thread Jukka Zitting
rom a contributor to a committer works at Apache. BR, Jukka Zitting

Re: Body of Outlook msg files

2011-12-07 Thread Jukka Zitting
e. [1] https://issues.apache.org/jira/browse/TIKA BR, Jukka Zitting

Re: Constraining Tika's memory usage (using ForkParser possibly?)

2011-12-02 Thread Jukka Zitting
ntain most of the metadata entries normally returned in the Metadata object." [1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html [2] http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392 BR, Jukka Zitting

Re: Tika returning nulls

2011-11-28 Thread Jukka Zitting
web site for the Tika version you're using. BR, Jukka Zitting

Re: Tika server mode? Protocol documentation?

2011-10-26 Thread Jukka Zitting
print. [Disclosure: I'm one of the authors] :-) Seriously though, contributions to documentation are very much welcome. BR, Jukka Zitting

Re: Resolving of relative URL's

2011-09-20 Thread Jukka Zitting
entering such an trap. Instead the crawler should employ heuristics like maximum recursion depth, etc. to avoid such problems. BR, Jukka Zitting

Re: Resolving of relative URL's

2011-09-12 Thread Jukka Zitting
so the next wrong link is resolved as > http://example.org/content/wrong-link/wrong-link/.. > > An endless nightmare for a crawler :) How would not resolving the links in Tika help in this case? To crawl the site, the crawler would in any case have to resolve the links, and come up with the exact same resolved URLs. BR, Jukka Zitting

Re: Resolving of relative URL's

2011-09-12 Thread Jukka Zitting
a looks at the CONTENT_LOCATION and RESOURCE_NAME_KEY metadata keys for the default base URL. If neither is present and there is no element, then URLs in the document will not be resolved. BR, Jukka Zitting

Re: Closing streams (Was: Tika leaves files open)

2011-09-01 Thread Jukka Zitting
onclusively fixed the problem that started this discussion. See TIKA-701 and the related commits for details. BR, Jukka Zitting

Re: Closing streams (Was: Tika leaves files open)

2011-09-01 Thread Jukka Zitting
ption" I think it's OK? > > But maybe we can beef up its javadocs a bit, saying "NOTE: unlike all > other APIs parsing from an InputStream, this API closes the incoming > InputStream for you for convenience" or something? I added something along these lines in revision 1164049. BR, Jukka Zitting

Re: Tika leaves files open

2011-09-01 Thread Jukka Zitting
n out in case where close() fails when no other exception has been thrown. Instead of one exception masking another, you'd have no exceptions masking one! BR, Jukka Zitting

Re: Tika leaves files open

2011-08-31 Thread Jukka Zitting
der the problem rather theoretical and would rather opt for cleaner code that avoids the extra constructs. BR, Jukka Zitting

Re: Tika leaves files open

2011-08-31 Thread Jukka Zitting
Hi, On Tue, Aug 30, 2011 at 11:19 PM, Jukka Zitting wrote: > Yes, I think you're right. I believe the problem here is the > openContainer field within TikaInputStream where the container-aware > type detection code stores the already opened container (in this case > an NPOIFS

Re: Closing streams (Was: Tika leaves files open)

2011-08-31 Thread Jukka Zitting
Hi Mark, On Wed, Aug 31, 2011 at 5:31 PM, Mark Kerzner wrote: > I used this statement > [...] > but still got many deleted files left opened. The problem is not with your code, it's what happens inside Tika. BR, Jukka Zitting

Re: Tika leaves files open

2011-08-30 Thread Jukka Zitting
iles mechanism to a more generic TemporaryResources class that could also take care of properly disposing also non-file resources associated with a TikaInputStream instance. BR, Jukka Zitting

Closing streams (Was: Tika leaves files open)

2011-08-30 Thread Jukka Zitting
ng call is made, it makes more sense for the parseToString() method to take care of closing the stream. The result is that the above code can be reduced to: return tika.parseToString(..., ...); BR, Jukka Zitting

Re: MediaType detection doesn't return concrete media types

2011-08-21 Thread Jukka Zitting
lly registered at IANA. The alias settings in the mimetypes file allow Tika to correctly detect such aliases and to automatically map them to the official type name. > And how to get the iana.org mime-type name instead of sub-class-of type name ? See above. [1] https://issues.apache.org/ji

Re: Tika 0.8 failure rates

2011-08-13 Thread Jukka Zitting
rself, you need to checkout them from version control. See [1] for more instructions on getting started and [2] on how to checkout the latest source tree. [1] http://tika.apache.org/0.9/gettingstarted.html [2] http://tika.apache.org/source-repository.html BR, Jukka Zitting

Re: Tika 0.8 failure rates

2011-08-12 Thread Jukka Zitting
#x27;re seeing and details of your build environment (output of "mvn --version" is pretty good). [1] https://repository.apache.org/content/groups/snapshots-group/org/apache/tika/tika-app/. BR, Jukka Zitting

Re: tag attributes problem.

2011-08-03 Thread Jukka Zitting
lMapper object properly through the ParseContext object to the parsing process? You should have a line of code like this somewhere: context.set(HtmlMapper.class, new MyCustomHtmlMapper()); BR, Jukka Zitting

Re: Excluding part of document when parsing.

2011-06-24 Thread Jukka Zitting
d such exclusion rules into the formats-specific parsers (for example it's easy to exclude headers and footers within the office format parsers). BR, Jukka Zitting

Re: non-West European languages support

2011-06-24 Thread Jukka Zitting
ed text to a Writer or an OutputStream, you can use the WriteOutHandler class for that. To explicitly specify the output encoding you want, use a java.io.OutputStreamWriter wrapper around your output stream. BR, Jukka Zitting

Re: non-West European languages support

2011-06-22 Thread Jukka Zitting
n.com/javase/technologies/core/basic/intl/faq.jsp#default-encoding BR, Jukka Zitting

Re: How to install Tika 0.9?

2011-06-19 Thread Jukka Zitting
nstallation depends on how you want to use Tika, e.g. as a Maven/Ant dependency, a standalone runnable jar, or something else. BR, Jukka Zitting

Re: java.lang.OutOfMemoryError: requested bytes for CHeapObj-new. Out of swap space?

2011-06-16 Thread Jukka Zitting
oduced in Tika 0.9 can be used to run text extraction in a background process so that a possible OOM error or even a JVM crash won't affect your application. BR, Jukka Zitting

Re: Minimum jar for detection

2011-05-25 Thread Jukka Zitting
t'll pick up the > container parsers dynamically for you Yes, you can just do: String mimeType = new Tika().detect(in); This will automatically find and use all the detectors available in the classpath, and will even take care of the TikaInputStream wrapping for you. BR, Jukka Zitting

Re: tag attributes problem.

2011-05-18 Thread Jukka Zitting
the details. [1] http://tika.apache.org/0.9/api/org/apache/tika/parser/html/HtmlMapper.html BR, Jukka Zitting

Re: Can't detected MimeType from FileInputStream

2011-05-04 Thread Jukka Zitting
On 05/04/2011 09:55 AM, Jukka Zitting wrote: On 05/04/2011 07:55 AM, Sascha Rodekamp wrote: Here are an extraction of my code to understand what i'm try to do: InputStream is = new FileInputStream(file); Tika tika = new Tika(); tika.detect(is); It alwas throws an java.io.IOException:

Re: Can't detected MimeType from FileInputStream

2011-05-04 Thread Jukka Zitting
stack trace? -- Jukka Zitting

Re: is the method "detect" of instance "org.apache.tika.Tika" thread safe ?

2011-04-26 Thread Jukka Zitting
Hi, On 23.04.2011 06:02, Jin Xu wrote: > is the method "detect" of instance "org.apache.tika.Tika" thread > safe, please ? Yes, it is. -- Jukka Zitting

Re: Which mime type in ParseUtils.getStringContent() ?

2011-04-07 Thread Jukka Zitting
ails. [1] http://tika.apache.org/0.9/api/org/apache/tika/Tika.html BR, Jukka Zitting

Re: Illegal IOException from tika.parser

2011-04-05 Thread Jukka Zitting
to report such issues so they can be fixed in future versions. -- Jukka Zitting

RE: how do I specify different encodings with "--text --encoding="?

2011-03-21 Thread Jukka Zitting
Hi, From: Ilya Zavorin [mailto:izavo...@caci.com] > So how do I get #1 but with BOM? Try using --encoding=UnicodeLittle. See [1] for the available encoding names in Java 5. [1] http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html BR, Jukka Zitting

Re: extracting text from tiff files from jackrabbit

2011-03-14 Thread Jukka Zitting
. -- Jukka Zitting

Re: can Tika handle filenames with Unicode characters?

2011-03-09 Thread Jukka Zitting
erify this, try replacing the 'java -jar "C:\Code\ATEK\CT\apache-tika-0.9\tika-app-0.9.jar" --xml' part of your command line with a simpler command like just 'dir'. I predict that you'll get a "File Not Found" error message as a result. -- Jukka Zitting

Re: error while building Tika using Maven

2011-03-07 Thread Jukka Zitting
direct download from the Tika web site and perhaps include a note there to look at Maven Central for the other jars. -- Jukka Zitting

RE: problems parsing an xls spreadsheet

2010-12-22 Thread Jukka Zitting
pen when you serialize SAX events into a character or byte stream. The characters in the array passed in a characters() event are not escaped. BR, Jukka Zitting

RE: PDF text extracted without spaces

2010-12-05 Thread Jukka Zitting
an unfortunate regression that got included in the 0.8 release. See TIKA-548 [1] for the details. The problem is fixed in the latest 0.9-SNAPSHOT version, and we probably should cut a new release soon with this fix. [1] https://issues.apache.org/jira/browse/TIKA-548 BR, Jukka Zitting

RE: MimeType detection and fall back

2010-12-05 Thread Jukka Zitting
een somewhat neglected lately after the introduction of the Detector interface and the Tika.detect() convenience methods. I'd like to deprecate the getMimeType() methods once we have equivalent or better alternatives in the Tika façade class. BR, Jukka Zitting

RE: Upgrading Solr to Tika 0.8

2010-12-05 Thread Jukka Zitting
metadata fields to avoid confusion later on, so you may want to prepare for some extra upgrade work with 1.0. BR, Jukka Zitting

RE: [VOTE] Apache Tika 0.8 Release Candidate #1

2010-11-11 Thread Jukka Zitting
tch release when a fix is available. PS. We'll need to update the copyright year in NOTICE.txt. BR, Jukka Zitting

RE: TIKA + JNI problem

2010-11-05 Thread Jukka Zitting
simply put the full tika-app jar in your classpath instead of tika-core and tika-parsers. See the dependency notes in http://tika.apache.org/0.7/gettingstarted.html for more background. BR, Jukka Zitting

Re: Tika for mediawiki ?

2010-10-25 Thread Jukka Zitting
if you already have the markup of a wiki page available as a string or a character stream (for example if you're accessing the underlying database or JSON exports directly), then there may be no need to involve Tika in the process. BR, Jukka Zitting

Re: Tika for mediawiki ?

2010-10-24 Thread Jukka Zitting
POI or PDFBox directly to produce documents in a specific format. BR, Jukka Zitting

Re: Adding PDFs to Jackrabbit, class cast exceptions[SEC=UNCLASSIFIED]

2010-10-21 Thread Jukka Zitting
ou have a full stack trace of the error? BR, Jukka Zitting

Re: configuration file

2010-10-18 Thread Jukka Zitting
e sources of that class. If you need to tweak the set of parsers used by your application, a better alternative would probably be something like using the new AutoDetectParser(Parser... parsers) constructor available in the svn trunk (and in the upcoming 0.8 release). BR, Jukka Zitting

Re: Plugging in your own parser to override an existing

2010-10-08 Thread Jukka Zitting
l of them in tika-config? Yes. If a tika-config.xml has been specified (by calling a non-default TikaConfig constructor), then only the parser classes listed in that configuration file are loaded. BR, Jukka Zitting

Re: Compressed RTF / TNEF / LZFU

2010-10-03 Thread Jukka Zitting
gestion to implement this feature directly in POI. [1] http://www.iana.org/assignments/media-types/application/vnd.ms-tnef [2] http://www.apache.org/legal/resolved.html [3] http://www.apache.org/legal/resolved.html#criteria BR, Jukka Zitting

  1   2   >