solution.
It would be great if someone wanted to carry on with further
improvements.
--
Jukka Zitting
sm to correctly detect such files. To avoid the extra work, you
could simply mark your new parser as being able to handle all files of
the more generic type, and then in your parser include a fallback
option to call the original Tika parser when encountering a file the
new parser can't handle.
BR,
Jukka Zitting
tection to the parsing
phase [2].
[1]
https://tika.apache.org/1.6/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer()
[2]
https://github.com/apache/tika/blob/1.6/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java#L385
BR,
Jukka Zitting
Hi,
Copyright also covers databases, so we'll need to honor the license
terms equally when copying file's code or detection patterns. Luckily
file (from http://www.darwinsys.com/file/) comes under a BSD license,
so reusing the code or data is quite simple from a licensing
perspective. In fact we'v
tor, so it would be good to have one or two new volunteers. See
http://apache.org/dev/committers.html#mailing-list-moderators for more
details.
Best,
Jukka Zitting
Searching for a groupId in https://issues.sonatype.org/projects/OSSRH can
also help in other similar cases.
Best,
Jukka Zitting
uch images also as a simple preview mechanism.
BR,
Jukka Zitting
If you want to use a more recent POI version, you need to use the
latest Tika 1.0-SNAPSHOT version from svn trunk.
[1] http://tika.apache.org/0.9/gettingstarted.html
BR,
Jukka Zitting
ng?
Correct, you'd need to store the preview somewhere.
Note that with the TeeContentHandler class you can get both a
text-only output for indexing and an XHTML output for preview from a
single parsing pass through Tika.
BR,
Jukka Zitting
t designed to preserve
the full formatting of the original document.
For a more accurate document preview feature you'll need to look for
solutions beyond Tika.
BR,
Jukka Zitting
Hi,
On Tue, Aug 30, 2011 at 7:20 PM, Michael McCandless
wrote:
> Could someone (jira admin) please give me (mikemccand in Jira) enough
> karma so I can assign issues to myself?
Done.
BR,
Jukka Zitting
s. Our Maven build already standardizes to
UTF-8, but there's no guarantee that someone who later edits the file
uses the correct encoding settings.
BR,
Jukka Zitting
worries about temporary files, so it won't close the possible
temporary stream created in getFile(). And even more worryingly the
getFile() or afterRead() methods of the temporary TikaInputStream
instance could still end up closing the underlying stream even though
that's exactly what we're trying to avoid with this construct.
BR,
Jukka Zitting
tions.
I think that's too high a price to pay for the IMHO rather marginal
benefits. Let's wait for the upgrade to Java 7 and do it properly
then.
BR,
Jukka Zitting
tible class there.
I filed TIKA-703 [1] to track the removal of all deprecated parts of
our public API.
[1] https://issues.apache.org/jira/browse/TIKA-703
BR,
Jukka Zitting
major version upgrades like the 0.x to
1.x jump we're about to make.
BR,
Jukka Zitting
onality with
some more specific checks in revision 1165259, and the resulting code
should now work correctly with all the test documents we have.
Improvements welcome, as I'm no expert on POI or the Office file format.
BR,
Jukka Zitting
s/trunk/.svn/lock'>:
> Permission denied
Not sure what's the problem there. As a workaround I simply configured
the Tika-trunk build to not use the solaris2 build slave where this
problem occurs.
BR,
Jukka Zitting
Hi,
2011/9/5 Maxim Valyanskiy :
> 05.09.2011, в 16:23, Jukka Zitting написал(а):
>> This was my attempt at properly handling the embedded PDF in
>> TestWithPdf.docx. It was included in an OLE object with the PDF
>> document as it's "CONTENTS" entry. I restored
nk to get the latest
code out while we wait for 1.0 to be ready for release.
BR,
Jukka Zitting
Hi,
On Sun, Sep 18, 2011 at 12:06 PM, Apache Jenkins Server
wrote:
> mojoFailed org.apache.felix:maven-bundle-plugin:2.3.5(default-bundle)
My mistake, fixed in revision 1172241. The maven-bundle-plugin version
2.3.5 has a dependency to Java 6, version 2.3.4 works also with Java
5.
BR,
Ju
Hi,
On Wed, Sep 21, 2011 at 6:18 PM, wrote:
> TIKA-716 Fix tika-bundle dependency list following apache-Mime4J upgrade
Good catch, thanks!
BR,
Jukka Zitting
1.0 release.
I think the trunk is pretty much ready to be released already, so I'd
suggest we cut the release already this week, for example over the
weekend. Chris, do you want to take care of it? I should also have
some spare cycles to cut the release if needed.
BR,
Jukka Zitting
l probably need to
extend the Metadata class to handle things like namespaces and
structured values.
BR,
Jukka Zitting
Hi,
On Fri, Sep 23, 2011 at 3:06 PM, Ken Krugler
wrote:
> On Sep 23, 2011, at 3:24am, Jukka Zitting wrote:
>> In any case it would still be good to mapRDFa tags also to the
>> Metadata object. To do that properly (and to open the way to better
>> XMP integration, m
o position Tika more
prominently on their radars. The Any23 proposal that Chris is
championing is one good chance for this.
Also, now that I work at Adobe, my XMP itch has been growing quite a
bit, so I wouldn't be surprised if I ended up working on better XMP
(and thus RDF) support soon after Tika 1.0 is out.
BR,
Jukka Zitting
Hi,
On Fri, Sep 23, 2011 at 11:16 PM, Nick Burch wrote:
> I'm fairly sure it's not related to my changes, but happy to be corrected if
> it is!
Looks like the culprit is my change to the way the attributes
are resolved. I'm just fixing it.
BR,
Jukka Zitting
Hi,
On Fri, Sep 23, 2011 at 11:18 PM, Jukka Zitting wrote:
> Looks like the culprit is my change to the way the attributes
> are resolved. I'm just fixing it.
Fixed in revision 1175043.
BR,
Jukka Zitting
ither way (recut or update the RC) is fine by me.
BR,
Jukka Zitting
n/apache-tika-0.10/rc1/CHANGES-0.10.txt
[2] http://www.apache.org/dist/tika/CHANGES-0.9.txt
BR,
Jukka Zitting
5.
BR,
Jukka Zitting
jar name as short as possible. Ideally
we'd even drop the -app part, but that would make the Maven setup a
bit awkward.
BR,
Jukka Zitting
for the tika-bundle-it component.
BR,
Jukka Zitting
tml
BR,
Jukka Zitting
issue tracker to add a new bug?
In the upper right corner of
https://issues.apache.org/jira/browse/TIKA you should see a login
link. If you don't already have an account, you can register one by
following the link on the login screen.
BR,
Jukka Zitting
Hi,
On Fri, Oct 14, 2011 at 12:10 AM, Apache Jenkins Server
wrote:
> See
> <https://builds.apache.org/job/Tika-trunk/org.apache.tika$tika-parsers/683/changes>
Sorry, my mistake. Fixed in revision 1183239.
BR,
Jukka Zitting
ghts?
Why not just use Tika.getDetector()? Or new DefaultDetector()?
TikaConfig doesn't currently have anything to do with Detectors, so my
instinct would be to avoid such an extra method unless we actually
want to add some sort of a custom detector configuration mechanism.
BR,
Jukka Zitting
ess to the
underlying functionality, and the Tika constructors allow complete
customization of these component instances, including by specifying a
custom TikaConfig.
BR,
Jukka Zitting
tDetector() method to TikaConfig.
BR,
Jukka Zitting
ting new features that
otherwise might get lost in the noise.
BR,
Jukka Zitting
what
happened than a detailed listing of each individual change would have
done.
BR,
Jukka Zitting
never
there's a good chance for that.
We haven't really lived up to such an ideal lately, but big +1 for
bringing this up and leading the way!
BR,
Jukka Zitting
recent revisions for the
details.
BR,
Jukka Zitting
r the release process, and we should then
have the release out nicely just in time for the ApacheCon.
BR,
Jukka Zitting
tests to a
separate java6 profile.
> [WARNING] File encoding has not been set, using platform encoding
> ANSI_X3.4-1968, i.e. build is platform dependent!
Hmm, looks like we should set source encoding explicitly to UTF-8...
BR,
Jukka Zitting
ntly come out corrupted if you don't have
> this in your classpath.
+1
BR,
Jukka Zitting
Hi,
On Thu, Oct 27, 2011 at 6:42 PM, Jukka Zitting wrote:
> How about if we leave the trunk open still for the weekend, and cut
> the 1.0 release candidate at the beginning of next week?
With TIKA-565 and TIKA-763 resolved the trunk is now ready for release
as far as I'm concerned.
Hi,
On Tue, Nov 1, 2011 at 6:10 PM, Nick Burch wrote:
> On Tue, 1 Nov 2011, Jukka Zitting wrote:
>> TIKA-764 is currently marked for 1.0. Nick, is it ready to be resolved or
>> should we postpone it to a later release?
>
> We should maybe split it and resolve the first par
Hi,
On Fri, Nov 4, 2011 at 4:42 PM, Mattmann, Chris A (388J)
wrote:
> Please vote on releasing this package as Apache Tika 1.0.
[x] +1 Release this package as Apache Tika 1.0
[ ] -1 Do not release this package because...
Signatures, build, etc. OK. Thanks!
BR,
Jukka Zitting
py, Tika.rb, Tika.js, Tika.pm and Tika.php bindings (plus
whatever else people may be interested in) that just reflect the key
functionality found in Tika.java.
Anyone interested in joining such an effort? Any pointers to existing
work along similar lines?
BR,
Jukka Zitting
Hi,
The effort spent on CHANGES.txt is clearly paying off. See for example
[1] where the information is nicely being spread to a wider audience.
[1] http://java.dzone.com/news/apache-tika-10-solidifies
BR,
Jukka Zitting
lity
error.
That's such a minor issue that I just explicitly excluded the enum
types from the clirr check in revision 1200889.
BR,
Jukka Zitting
tadata-extractor
BR,
Jukka Zitting
hich I'm a member), but I suppose we should be
able to come up with an arrangement where Tika committers can commit
directly to the Tika parser implementation in PDFBox.
It would be cool if we could do the same thing also with POI.
WDYT?
[1] https://issues.apache.org/jira/browse/PDFBOX-1132
BR,
Jukka Zitting
of the tika-developers group which grants
full admin access to the TIKA project in Jira.
I just added you and Jérôme to the this group. Enjoy!
BR,
Jukka Zitting
nk the tradeoff favored focusing our work on Tika
itself, but now with stable 1.0 APIs I think the time may be ripe to
start reducing the size of tika-parsers (which has been growing pretty
much, see [1]).
[1] https://www.ohloh.net/p/tika/analyses/latest
BR,
Jukka Zitting
y upgrading the relevant parser libraries if they face problems with
a particular document.
BR,
Jukka Zitting
settings.
> - when we release new tika version, old pdfbox may not work
> with it until the next release
We're explicitly committed to maintaining backwards compatiblity (see
https://issues.apache.org/jira/browse/TIKA-699) until Tika 2.0, so any
case where a new Tika release breaks an existing upstream parser
should be treated as a bug and fixed.
BR,
Jukka Zitting
as just thinking of stuff like that a parser should preferably use
XMP schemas when exposing metadata, not about inventing our own
schemas.
BR,
Jukka Zitting
Hi Adam,
Welcome! To subscribe, send a message to
dev-subscr...@tika.apache.org. For more details, see
http://tika.apache.org/mail-lists.html.
BR,
Jukka Zitting
Hi,
On Mon, Jan 30, 2012 at 3:40 PM, Nick Burch wrote:
> What do people think is the best way to handle this sort of thing?
I'd go with XMPDM, as that's already a dependency of the described
piece of code.
BR,
Jukka Zitting
uments).
Opening the Metadata class for convenience methods like these can be a
Pandora's box, but it would also simplify quite a bit of code both on
the client and the parser side.
BR,
Jukka Zitting
Hi,
On Mon, Jan 30, 2012 at 4:20 PM, Nick Burch wrote:
> On Mon, 30 Jan 2012, Jukka Zitting wrote:
>> What we might also consider as an extra convenience, are Metadata methods
>> like: [...]
>
> If we're doing that sort of thing, then I'd rather we put the logic on
Hi,
On Fri, Feb 17, 2012 at 7:26 PM, wrote:
> The Buildbot has detected a new failure on builder tika-trunk
Sorry, my config handling change apparently broke OSGi service
loading. I'll fix that later tonight.
BR,
Jukka Zitting
retty soon.
In fact given the time that has passed since 1.0, I think it would be
a good idea to push for a 1.1 release already this month.
BR,
Jukka Zitting
Hi,
On Wed, Mar 7, 2012 at 10:35 PM, Mattmann, Chris A (388J)
wrote:
> Please vote on releasing this package as Apache Tika 1.1.
[x] +1 Release this package as Apache Tika 1.1
Thanks!
BR,
Jukka Zitting
ated when the build is run in a Java 6+ environment.
BR,
Jukka Zitting
IMO a more appropriate verb to use is POST, that's meant (among other
things) for:
"Providing a block of data [...] to a data-handling process;"
... which is what tika-server does.
BR,
Jukka Zitting
configuration or adding an explicit exclude rule to the rat plugin
configuration.
BR,
Jukka Zitting
n tika-server. Anyone?
BR,
Jukka Zitting
]
https://marketplace.atlassian.com/plugins/com.sourcelabs.jira.plugin.report.contributions
BR,
Jukka Zitting
mean the native
list formatting of those document types?
The Tika parsers for PDF and Office documents could/should
automatically map such formatting to equivalent XHTML constructs, but
I don't think they currently do. You'll need to look into the source
code to see how to make that happen.
BR,
Jukka Zitting
Stream(stream, 1000));
However, see the concern in TIKA-307 [2]. Passing a truncated stream
to Tika may produce unexpected results.
[1]
http://commons.apache.org/io/api-release/org/apache/commons/io/input/BoundedInputStream.html
[2] https://issues.apache.org/jira/browse/TIKA-307
BR,
Jukka Zitting
Hi,
On Wed, Jun 6, 2012 at 2:15 PM, Baranee wrote:
> Can u pls tell me how to use the beforeRead() method in TikaInputStream to
> set readlimit for reading bytes from a stream.
http://people.apache.org/~hossman/#xyproblem
Why do you want to use TikaInputStream like this?
BR,
Jukka Zitting
then invoke the standard XMLParser on the
result.
BR,
Jukka Zitting
oaches where 1) is used to ensure
contractual correctness and 2) to prevent too eager spooling of
streams (and to act as a failsafe in case some code fails to honor
requirement 1).
WDYT?
BR,
Jukka Zitting
Hi,
On Fri, Jun 29, 2012 at 11:21 PM, wrote:
> The Buildbot has detected a new failure on builder tika-trunk while building
> ASF Buildbot.
Oops, sorry about that. Fixed in revision 1355579.
BR,
Jukka Zitting
dded a workaround in revision 1355746.
[1] http://jira.codehaus.org/browse/MSHADE-23
BR,
Jukka Zitting
brary [1].
[1] http://hc.apache.org/httpcomponents-core-ga/
BR,
Jukka Zitting
Hi,
On Sun, Jul 1, 2012 at 6:27 PM, Mattmann, Chris A (388J)
wrote:
> On Jul 1, 2012, at 5:09 AM, Jukka Zitting wrote:
> Sergey Beryozkin (who I'm CC'ing on this email since I'm not sure
> he's subscribed to dev@) helped by providing guidance on the CXF
> side
ts,
> advanced search capabilities, OAuth2, seem to be of possible use in the
> project.
That's all fine, but do we really need such features in Tika? For
example, what could tika-server possibly need OAuth2 for?
BR,
Jukka Zitting
(com/adobe/xmp/XMPException.class)
> class file has wrong version 50.0, should be 49.0
Hmm, looks like Java 6 is needed for the xmpcore dependency. For now I
simply solved this issue by moving the tika-xmp module to a separate
java6 profile in revision 1356510, but I think we need some better
Hi,
On Tue, Jul 3, 2012 at 8:57 AM, Apache Jenkins Server
wrote:
> cause : Too many unapproved licenses: 1
That was the tika-dotnet/.gitignore file I added earlier. It's no
longer needed, so I removed it in revision 1356619.
BR,
Jukka Zitting
o access Tika features without
having to spawn a separate Java process for that.
BR,
Jukka Zitting
Hi,
On Tue, Jul 3, 2012 at 4:03 PM, Joerg Ehrlich wrote:
> A new version of XMPCore compiled for JDK 1.5 has been uploaded to Maven
> Central: 5.1.2
Great! In revision 1356776 I upgraded the XMPCore dependency and moved
tika-xmp back to the main build.
BR,
Jukka Zitting
Hi,
On Wed, Jul 4, 2012 at 12:13 PM, wrote:
> BUILD FAILED: failed compile
Looks like a Buildbot error. The build works fine for me locally.
BR,
Jukka Zitting
c available in the media
type registry. With the isInstanceOf helper method I just added this
becomes:
String type = metadata.get(Metadata.CONTENT_TYPE);
MediaTypeRegistry registry = ...;
if (registry.isInstanceOf(type, MediaType.TEXT_HTML)) { ... }
BR,
Jukka Zitting
Hi,
On Tue, Jul 10, 2012 at 10:29 PM, Mattmann, Chris A (388J)
wrote:
> Please vote on releasing this package as Apache Tika 1.2.
[x] +1 Release this package as Apache Tika 1.2
BR,
Jukka Zitting
nce that's the
pattern we've been following also in Jackrabbit, based originally on
examples from HTTP Server and Lucene.
BR,
Jukka Zitting
unky character would
seem like the best workaround.
BR,
Jukka Zitting
ature
(automatically ignoring empty content).
What's the SAX library you're using to serialize the output from Tika?
You may also want to try the ToXMLContentHandler class in o.a.t.sax.
It can serialize SAX events and doesn't suffer from this problem.
BR,
Jukka Zitting
Hi,
On Tue, Jul 17, 2012 at 12:10 PM, Ray Gauss II wrote:
> Should I merge this to tags/1.2?
It's a bad idea to modify tags that have already been released. But it
should be fine to apply the patch manually before building the 1.2
javadocs for inclusion on the web site.
BR,
Jukka Zitting
Hi,
On Wed, Aug 1, 2012 at 4:22 PM, Ray Gauss II wrote:
> Anyone have ideas on this one? Is it really something I did?
Looks like a Jenkins problem. The Jenkins setup at Apache has been
quite unstable over the last few months.
BR,
Jukka Zitting
Hi,
Did someone already submit a talk about Tika to ApacheCon Europe [1]?
If not, I'll submit one.
[1] http://www.apachecon.eu/
BR,
Jukka Zitting
.
I'll ask on the users@ list if there are people planning to attend the
conference for more input on topics to cover.
BR,
Jukka Zitting
inding)
BR,
Jukka Zitting
still be clients out there that expect this
information to be present as CONTENT_ENCODING.
In fact, unless the abuse of that field is actively harmful (i.e.
clients need to add extra workarounds to clean up the metadata), I'd
keep the field in place all the way until Tika 2.0.
BR,
Jukka Zitting
to
automatically detect the correct encoding and use it if the declared
one is obviously incorrect.
BR,
Jukka Zitting
ntHandler is only interested in
stuff inside the element, not outside it.
BR,
Jukka Zitting
"body"); // no match, ignore
startElement("p"); // match, call super.startElement("p")
endElement("p"); // match, call super.endElement("p")
endElement("body"); // no match, ignore
endElement("html"); // no match, ignore
BR,
Jukka Zitting
1 - 100 of 799 matches
Mail list logo