Hi,
I wonder if we should just increase the default thresholds to allow deeper
nesting before the exception gets thrown. The defaults should be tuned to
make the false-positive rate as low as possible without opening the door
for false negatives that could result denial of service attacks.
The pa
Hi,
Based on the Xerces discussion it sounds like using a pool of parsers
would be the best approach.
Best,
Jukka
On Thu, May 17, 2018 at 11:51 AM, Sebastian Nagel
wrote:
> Hi,
>
> two questions regarding thread-safety and locking in Tika's MIME type
> detectors
> while investigating global l
, but IIRC such a feature doesn't
currently exist in Tika or the underlying PDFBox library.
Best,
Jukka Zitting
On Wed, Apr 13, 2016 at 8:52 AM ron.vandenbranden <
ron.vandenbran...@kantl.be> wrote:
> Hi again,
>
>
> On 13/04/2016 13:18, ron.vandenbranden wrote:
>
>
>
You can make the test pass by changing the assertion to:
assertTrue(IOUtils.contentEquals(stream, originalStream));
Wrapping a stream with TikaInputStream doesn't magically add
mark/reset support to the original stream; only the wrapper instance
has this feature.
eature.
As suggested by Nick, an easy way to meet that API contract is for the
client to wrap a stream into TikaInputStream before passing it to the
detector.
[1]
http://tika.apache.org/1.8/api/org/apache/tika/detect/Detector.html#detect(java.io.InputStream,%20org.apache.tika.metadata.Metadata)
t, so the lack of them coupled with even some fairly weak
SGML detection signals (stuff like upper case element names?) might be
enough to get significant improvements in this area.
BR,
Jukka Zitting
er.class, new IdentityHtmlMapper();
--
Jukka Zitting
ot leverage
that functionality, mostly because the underlying parsers should
already throw the correct exceptions. Thus it would be better to fix
this in PDFParser instead of AutoDetectParser.
BR,
Jukka Zitting
ggedInputStream(stream);
try {
parse(tagged);
} catch (IOException e) {
tagged.throwIfCauseOf(e); // throws IOException if from stream
throw new TikaException("Parse error", e);
}
[1] http://tika.apache.org/1.0/api/org/apache/tika/io/TaggedInputStream.html
BR,
Jukka Zitting
Hi,
On Fri, Mar 14, 2014 at 5:13 PM, Grant Ingersoll wrote:
> On Mar 13, 2014, at 3:53 PM, Jukka Zitting wrote:
>> On Thu, Mar 13, 2014 at 3:41 PM, Grant Ingersoll wrote:
>>> But why would that test fail in the Tika dev environment?
>>
>> The Defa
specify
the content type in the input metadata, it won't know how to parse the
document.
BR,
Jukka Zitting
Hi,
To unsubscribe this list, send a message to user-unsurbscr...@tika.apache.org.
See http://tika.apache.org/mail-lists.html for more details.
BR,
Jukka Zitting
would otherwise be swallowed by the
DefaultHtmlMapper strategy. You can write a custom ContentHandler
class that detects the "donotparse" attributes and skips all content
within such elements.
BR,
Jukka Zitting
per class
that implements the class="donotparse" strategy you describe. This
approach requires changes in Tika, so you might want to consider
submitting a patch of your (ideally backwards-compatible) changes.
BR,
Jukka Zitting
?
The code that instantiates the TikaInputStream should also take care
of disposing it properly. If that happens, you shouldn't experience
the filling up of the /tmp space that you described.
BR,
Jukka Zitting
a.io.InputStream,
org.apache.tika.io.TemporaryResources)
BR,
Jukka Zitting
ing, do you have just tika-core deployed (AFAIUI that's the default
with Sling)? The core bundle doesn't contain any parser components, so
it won't be able to extract text from any documents. Deploying
tika-bundle along with core should fix that.
BR,
Jukka Zitting
xt
> contained within a document, which will be returned as a string. Currently,
> I have my program set up in the following way:
Have you tried:
new Tika().parseToString(node.getBinary().getStream())
That should cover your use case and be much simpler than what you're now doing.
BR,
Jukka Zitting
#x27;t (for example if it's a scanned
image), then there's little we can do.
BR,
Jukka Zitting
ent. We
should probably also make Tika degrade more gracefully when a
particular encoding detector is not present.
[1] http://code.google.com/p/juniversalchardet/
[2]
http://search.maven.org/#artifactdetails%7Ccom.googlecode.juniversalchardet%7Cjuniversalchardet%7C1.0.3%7Cjar
BR,
Jukka Zitting
lers like that.
Assuming your textBuffer is a Writer instance, you could rather try
replacing handler1 with something like this:
ContentHandler handler1 = new WriteOutContentHandler(textBuffer);
BR,
Jukka Zitting
ng file.
The MS Office detectors (and a few other features in Tika) rely on
that functionality, and thus won't give as accurate results when given
just a plain InputStream instance.
BR,
Jukka Zitting
eports
generated by Maven aren't too useful (some are even misleading), which
is why we're not including links to them in the site template. If
there are individual reports (like the SCM page) that do make sense,
then it would be a good idea to selectively add that to the template.
BR,
Jukka Zitting
ect < shiftjs.txt # look only at the byte stream
application/octet-stream
$ java -jar tika-app.jar --detect shiftjs.txt # Give the file name
with .txt ending as a type hint
text/plain
$ java -jar tika-app.jar --text shiftjs.txt # Check that the encoding
is correctly detected
電子商取引(エレクトロニックコマース)、オンライン [...]
Yes!
BR,
Jukka Zitting
solution so how should it be solved?
I think the idea solution would be to have these changes included
directly in TagSoup.
BR,
Jukka Zitting
t use or require any specific logging
framework, but some of the parser libraries do, so in a typical Tika
deployment you'd already have at least the Commons Logging, SLF4J and
JUL interfaces available for logging.
BR,
Jukka Zitting
ar in Tika that you'd like to hear more
about.
[1] http://www.apachecon.eu/
BR,
Jukka Zitting
to use BoilerPipe on top of
instead of inside AutoDetectParser.
BR,
Jukka Zitting
lasses. That way code that for example checks the
type detection result against something like "text/plain" won't start
failing with a Tika version that might decide to qualify the type with
"text/plain; charset=UTF-8" or to return a more detailed media type
like "text/x-java-source".
BR,
Jukka Zitting
run charset
detection already earlier at that point?
BR,
Jukka Zitting
ision=1066132
[2] http://svn.apache.org/viewvc?view=revision&revision=1091833
BR,
Jukka Zitting
Hi,
On Sun, Jul 22, 2012 at 12:34 PM, Jukka Zitting wrote:
> On Sun, Jul 22, 2012 at 2:23 AM, Oliver Steinau
> wrote:
>> However still no success with the tika-app. Here's what I tried (bear
>> with me, I'm on a windows system...)
>
> The problem soun
the
end of stream from the client.
BR,
Jukka Zitting
echanism is used to load services.
And in any case the static service loading is a fairly cheap operation
that's typically only done once during the lifetime of an application
or a bundle.
BR,
Jukka Zitting
ectly in all such cases.
> Do we have to remove surpluss whitespace in Nutch ourselves?
I think that's the easiest solution here.
BR,
Jukka Zitting
ethod of obtaining access to the Detector and Parser
> involves something like this in your own bundles activator:
The reason why we use ServiceTrackers instead is that we want to
support deployments where new parser and detector services can be
added or removed dynamically from the running system.
BR,
Jukka Zitting
es no output, despite the file containing text.
> Tika tika = new Tika();
> System.out.print(tika.parseToString(new FileInputStream(xmlFile)));
See the BundleIT test case inside the tika-bundle component. That's a
pretty similar piece of code that works fine in an OSGi environment.
BR,
Jukka Zitting
Hi,
On Fri, Jul 6, 2012 at 5:00 PM, Kevin Milburn
wrote:
> On 2012/07/05 18:22, Jukka Zitting wrote:
>> upgrade to the latest 1.2 SNAPSHOT where declarative services is no longer
>> needed (see https://issues.apache.org/jira/browse/TIKA-896).
>
> I've built and installe
larative services is no
longer needed (see https://issues.apache.org/jira/browse/TIKA-896).
BR,
Jukka Zitting
st once you've
already loaded the Tika classes to memory. The server mode is
typically more interesting for non-Java clients that face the question
of either executing tika-app separately for each document or accessing
an already running server process.
BR,
Jukka Zitting
e of the tika-app simply parses documents sent through a
network connection programmatically or with a tool like netcat [1] and
responds with the parse output as governed by the rest of the tika-app
command line options.
[1] http://netcat.sourceforge.net/
BR,
Jukka Zitting
system
performance depends on your deployment details.
BR,
Jukka Zitting
n hour or so as the CI
build picks up that revision.
BR,
Jukka Zitting
OCR
tooling.
BR,
Jukka Zitting
Hi,
Currently the ForkParser doesn't return metadata, though adding that
feature shouldn't be too difficult. My original use case didn't need
metadata, so I never implemented that bit.
Jukka Zitting
1.5.2012 19.26 "Michael McCandless" kirjoitti:
> Does anyone know
ar tika-app-1.1.jar --detect sample_fixed.wde
java -jar tika-app-1.1.jar --detect < sample_fixed.wde
BR,
Jukka Zitting
also from just the byte stream.
A typical reason why an XML document is detected as text/plain is if
it's actually not valid XML, either because of some well-formedness
issue (unclosed tags) or because of some extra characters like
suggested by Nick.
BR,
Jukka Zitting
nt parser classes to determine which libraries they're using. Or
try removing dependencies until the parser you're interested in no
longer works.
BR,
Jukka Zitting
cause trouble for POI.
Or you can just try upgrading the dependency and see if it works. :-)
BR,
Jukka Zitting
mance reasons the container detection mechanism is
skipped.
Using new Tika().detect(new File(name)) takes care of all these
details for you, which is why it's the recommended way to do type
detection unless you explicitly need direct access to the lower-level
functionality in Tika.
BR,
Jukka Zitting
have trouble accessing the central Maven repository. Do
you have some firewall or HTTP proxy (perhaps a local Maven repository
manager) that could be blocking your access? Try seeing if you can
access http://repo1.maven.org/maven2/org/apache/apache/10/ directly in
your browse.
BR,
Jukka Zitting
Mapper.class, IdentityHtmlMapper.INSTANCE);
Parser parser = ...;
parser.parse(..., context);
BR,
Jukka Zitting
rom a contributor to a committer works at Apache.
BR,
Jukka Zitting
e.
[1] https://issues.apache.org/jira/browse/TIKA
BR,
Jukka Zitting
ntain most of the metadata entries normally returned in the Metadata
object."
[1] http://tika.apache.org/1.0/api/org/apache/tika/Tika.html
[2]
http://stackoverflow.com/questions/8349898/why-is-my-tika-metadata-object-not-being-populated-when-using-forkparser/8354392#8354392
BR,
Jukka Zitting
web site for the Tika version you're using.
BR,
Jukka Zitting
print. [Disclosure: I'm one of the authors] :-)
Seriously though, contributions to documentation are very much welcome.
BR,
Jukka Zitting
entering such an
trap. Instead the crawler should employ heuristics like maximum
recursion depth, etc. to avoid such problems.
BR,
Jukka Zitting
so the next wrong link is resolved as
> http://example.org/content/wrong-link/wrong-link/..
>
> An endless nightmare for a crawler :)
How would not resolving the links in Tika help in this case? To crawl
the site, the crawler would in any case have to resolve the links, and
come up with the exact same resolved URLs.
BR,
Jukka Zitting
a looks at the CONTENT_LOCATION and RESOURCE_NAME_KEY
metadata keys for the default base URL. If neither is present and
there is no element, then URLs in the document will
not be resolved.
BR,
Jukka Zitting
onclusively fixed the problem that started
this discussion. See TIKA-701 and the related commits for details.
BR,
Jukka Zitting
ption" I think it's OK?
>
> But maybe we can beef up its javadocs a bit, saying "NOTE: unlike all
> other APIs parsing from an InputStream, this API closes the incoming
> InputStream for you for convenience" or something?
I added something along these lines in revision 1164049.
BR,
Jukka Zitting
n out in case
where close() fails when no other exception has been thrown. Instead
of one exception masking another, you'd have no exceptions masking
one!
BR,
Jukka Zitting
der the problem rather theoretical and would
rather opt for cleaner code that avoids the extra constructs.
BR,
Jukka Zitting
Hi,
On Tue, Aug 30, 2011 at 11:19 PM, Jukka Zitting wrote:
> Yes, I think you're right. I believe the problem here is the
> openContainer field within TikaInputStream where the container-aware
> type detection code stores the already opened container (in this case
> an NPOIFS
Hi Mark,
On Wed, Aug 31, 2011 at 5:31 PM, Mark Kerzner wrote:
> I used this statement
> [...]
> but still got many deleted files left opened.
The problem is not with your code, it's what happens inside Tika.
BR,
Jukka Zitting
iles mechanism to a
more generic TemporaryResources class that could also take care of
properly disposing also non-file resources associated with a
TikaInputStream instance.
BR,
Jukka Zitting
ng call is made, it makes more sense for the
parseToString() method to take care of closing the stream. The result
is that the above code can be reduced to:
return tika.parseToString(..., ...);
BR,
Jukka Zitting
lly registered at IANA. The alias settings in the
mimetypes file allow Tika to correctly detect such aliases and to
automatically map them to the official type name.
> And how to get the iana.org mime-type name instead of sub-class-of type name ?
See above.
[1] https://issues.apache.org/ji
rself, you need to checkout them from
version control. See [1] for more instructions on getting started and
[2] on how to checkout the latest source tree.
[1] http://tika.apache.org/0.9/gettingstarted.html
[2] http://tika.apache.org/source-repository.html
BR,
Jukka Zitting
#x27;re seeing and details of your build environment
(output of "mvn --version" is pretty good).
[1]
https://repository.apache.org/content/groups/snapshots-group/org/apache/tika/tika-app/.
BR,
Jukka Zitting
lMapper object properly through the
ParseContext object to the parsing process? You should have a line of
code like this somewhere:
context.set(HtmlMapper.class, new MyCustomHtmlMapper());
BR,
Jukka Zitting
d such exclusion
rules into the formats-specific parsers (for example it's easy to
exclude headers and footers within the office format parsers).
BR,
Jukka Zitting
ed text to a Writer or
an OutputStream, you can use the WriteOutHandler class for that. To
explicitly specify the output encoding you want, use a
java.io.OutputStreamWriter wrapper around your output stream.
BR,
Jukka Zitting
n.com/javase/technologies/core/basic/intl/faq.jsp#default-encoding
BR,
Jukka Zitting
nstallation depends on
how you want to use Tika, e.g. as a Maven/Ant dependency, a standalone
runnable jar, or something else.
BR,
Jukka Zitting
oduced in Tika 0.9 can be used to run text
extraction in a background process so that a possible OOM error or
even a JVM crash won't affect your application.
BR,
Jukka Zitting
t'll pick up the
> container parsers dynamically for you
Yes, you can just do:
String mimeType = new Tika().detect(in);
This will automatically find and use all the detectors available in
the classpath, and will even take care of the TikaInputStream wrapping
for you.
BR,
Jukka Zitting
the details.
[1] http://tika.apache.org/0.9/api/org/apache/tika/parser/html/HtmlMapper.html
BR,
Jukka Zitting
On 05/04/2011 09:55 AM, Jukka Zitting wrote:
On 05/04/2011 07:55 AM, Sascha Rodekamp wrote:
Here are an extraction of my code to understand what i'm try to do:
InputStream is = new FileInputStream(file);
Tika tika = new Tika();
tika.detect(is);
It alwas throws an java.io.IOException:
stack trace?
--
Jukka Zitting
Hi,
On 23.04.2011 06:02, Jin Xu wrote:
> is the method "detect" of instance "org.apache.tika.Tika" thread
> safe, please ?
Yes, it is.
--
Jukka Zitting
ails.
[1] http://tika.apache.org/0.9/api/org/apache/tika/Tika.html
BR,
Jukka Zitting
to report such issues so they can
be fixed in future versions.
--
Jukka Zitting
Hi,
From: Ilya Zavorin [mailto:izavo...@caci.com]
> So how do I get #1 but with BOM?
Try using --encoding=UnicodeLittle. See [1] for the available encoding names in
Java 5.
[1] http://download.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html
BR,
Jukka Zitting
.
--
Jukka Zitting
erify this, try replacing the 'java -jar
"C:\Code\ATEK\CT\apache-tika-0.9\tika-app-0.9.jar" --xml' part of your
command line with a simpler command like just 'dir'. I predict that
you'll get a "File Not Found" error message as a result.
--
Jukka Zitting
direct
download from the Tika web site and perhaps include a note there to look
at Maven Central for the other jars.
--
Jukka Zitting
pen when you serialize SAX
events into a character or byte stream. The characters in the array passed in a
characters() event are not escaped.
BR,
Jukka Zitting
an unfortunate regression that got included in the 0.8 release. See
TIKA-548 [1] for the details.
The problem is fixed in the latest 0.9-SNAPSHOT version, and we probably should
cut a new release soon with this fix.
[1] https://issues.apache.org/jira/browse/TIKA-548
BR,
Jukka Zitting
een somewhat neglected
lately after the introduction of the Detector interface and the Tika.detect()
convenience methods. I'd like to deprecate the getMimeType() methods once we
have equivalent or better alternatives in the Tika façade class.
BR,
Jukka Zitting
metadata
fields to avoid confusion later on, so you may want to prepare for some extra
upgrade work with 1.0.
BR,
Jukka Zitting
tch release when a fix is available.
PS. We'll need to update the copyright year in NOTICE.txt.
BR,
Jukka Zitting
simply put the full tika-app jar in your classpath
instead of tika-core and tika-parsers.
See the dependency notes in http://tika.apache.org/0.7/gettingstarted.html for
more background.
BR,
Jukka Zitting
if you already have the markup of a wiki page available as a string
or a character stream (for example if you're accessing the underlying
database or JSON exports directly), then there may be no need to
involve Tika in the process.
BR,
Jukka Zitting
POI or
PDFBox directly to produce documents in a specific format.
BR,
Jukka Zitting
ou have a full stack trace of the error?
BR,
Jukka Zitting
e sources of that class.
If you need to tweak the set of parsers used by your application, a
better alternative would probably be something like using the new
AutoDetectParser(Parser... parsers) constructor available in the svn
trunk (and in the upcoming 0.8 release).
BR,
Jukka Zitting
l of them in tika-config?
Yes. If a tika-config.xml has been specified (by calling a non-default
TikaConfig constructor), then only the parser classes listed in that
configuration file are loaded.
BR,
Jukka Zitting
gestion to implement this feature directly in POI.
[1] http://www.iana.org/assignments/media-types/application/vnd.ms-tnef
[2] http://www.apache.org/legal/resolved.html
[3] http://www.apache.org/legal/resolved.html#criteria
BR,
Jukka Zitting
1 - 100 of 121 matches
Mail list logo