All,
I just added temurin-17 and temurin-21 to the vm. I'm going to set
the default to temurin-17.
Best,
Tim
All,
The lazy consensus on
https://issues.apache.org/jira/browse/LEGAL-696 was to keep the vm
running, but to limit access to committers. The public file server
will remain shutdown.
If anyone has any other options, please let us know.
Thank you, again, Maruan, for continuing to fund the vm!
ilable at one time or another.
>
> I could be wrong, but Privacy is knows GPDR while LEGAL knows the AL2, etc.
>
> > On Jan 16, 2025, at 8:26 AM, Tim Allison wrote:
> >
> > This is a really helpful delineation of the issues. Thank you, Maruan,
> for
> > thi
ccess in a wider context
> and even need to reconsider if we can use it at all although being very
> beneficial.
>
> Maybe we can have a chat with legal about that.
>
> BR
> Maruan
>
>
>
>
> Am Dienstag, dem 14.01.2025 um 08:17 -0500 schrieb Tim Allison:
> >
s from
> your server, as the official govdocs1 sources do not expose the single
> PDF files directly.
>
> Thanks for hosting these files in the past.
>
> Best regards,
> Stefan
>
> On 2025/01/09 01:36:59 Tim Allison wrote:
> > \All,
> > We've gotten a handfu
;
> > this is unfortunate but as this is posing the risk of legal actions to
> the ASF but also to me hosting the site I think we should stop that.
> >
> > BR
> > Maruan
> >
> >> Am 09.01.2025 um 02:37 schrieb Tim Allison :
> >>
> >> \All
\All,
We've gotten a handful of takedown requests recently. I had initially
envisioned public sharing of files as a key component of our server. We can
still use the files and offer read access to fellow file researchers. I'm
not sure I want to deal with further takedown requests.
As an intermedi
ing up and
funding the corpora server!
Cheers,
Tim
On Sat, May 20, 2023 at 5:44 AM Andreas Lehmkuehler
wrote:
> Hi,
>
> Am 19.05.23 um 17:25 schrieb Tim Allison:
> > All,
> >
> >Tilman Hausherr mentioned that we might want to update the
> > common
All,
Tilman Hausherr mentioned that we might want to update the
common-crawl pdfs in our regression corpus. This proposal leaves the
bugtracker PDFs as they are.
For the CC-based PDFs, we could:
1) remove existing truncated pdfs
2) fold in newer untruncated PDFs from:
https://digitalcorpora.
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch
of truncated files. We refetched some and put those under
commoncrawl3_refetched.
On Tue, Jul 26, 2022 at 2:58 PM Tim Allison wrote:
> We have ~1.9TB. But I'd skip cc_large because that's just a
173
text/x-vcalendar1156
On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr
wrote:
> We were thinking something around 2TB of data with a good mix of excel,
> images, pdfs, text and powerpoints. So I guess a mix of everything.
>
>
>
> *From: *Tim Allison
> *Date: *Tuesday, Jul
What Nick said...
cc_large is a sample of some of the larger documents from
commoncrawl3_refetched.
If you want to give your pipeline a workout, I also recommend using the
MockParser that is available in the tika-core tests jar. That allows you
to instrument an OOM and timeouts and system exits
Hi Albretch,
Thank you for the pointer. The PDFs that I packaged there were gathered
from various bug trackers [1] and [2]. I'm not necessarily against
gathering the Regents exams, but they would represent a different purpose.
Best,
Tim
[1] https://www.pdfa.org/a-new-
And, yes... the link does not yet work unless you're me or have my
login info... Should be up by tomorrow...
On Tue, Oct 19, 2021 at 9:36 AM Tim Allison wrote:
>
> All,
> I started an Apache Tika Community Meetup group:
> https://www.meetup.com/apache-tika-community/
>
All,
I started an Apache Tika Community Meetup group:
https://www.meetup.com/apache-tika-community/
Let's try this out for semi-regular meetups to talk about where we
are and where we're headed. The scope of these meetups should be
fairly broad to include all areas of file processing (including
Possibly in response to a recent PDF Days talk (?)[0], Micky Lindlar
asked on twitter if anyone had seen JPEG XL files in the wild[1].
I added jxl detection to Tika and re-detected all the files that had
been previously identified as "application/octet-stream". I found
~462 likely jxl files. I h
I _think_ it all works as it should. I also updated datasette and relaunched.
Onwards!
Cheers,
Tim
cessing or packaging, please let me know.
Cheers,
Tim
On Wed, Mar 17, 2021 at 4:21 PM Tim Allison wrote:
>
> > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd
> > be of any interest.
>
> If there were a commandline or a Ja
> Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd be
> of any interest.
If there were a commandline or a Java SDK, I could run that next if
that'd be of any interest. :D
On Wed, Mar 17, 2021 at 3:28 PM Tim Allison wrote:
>
> Ah, I wasn'
> formats. It's what all the Adobe apps use to handle XMP in any file format
> that we encounter.
>
> Leonard
>
> On 3/17/21, 2:48 PM, "Tim Allison" wrote:
>
> Wait...I'm sorry...I'm wrong on the first point.
>
> 1) in Tika genera
ther metadata to XMP in our tika-xmp module, and
xmpcore is a dependency of Drew Noakes' metadata-extractor which is
critical.
On Wed, Mar 17, 2021 at 2:43 PM Tim Allison wrote:
>
> >Isn't that why are you using the XMP Toolkit???
>
> Sorry, we may be talking about two d
l
wrote:
>
> >The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> >
> Isn't that why are you using the XMP Toolkit???
>
> Leonard
>
> On 3/17/21, 2:10 PM, "Tim Allison" wrote:
>
> > ARGH Please don't
"http://ns.fotoware.com/iptcxmp-custom/1.0/";
xmlns:fwu="http://ns.fotoware.com/iptcxmp-user/1.0/";
xmlns:dc="http://purl.org/dc/elements/1.1/";
xmlns:Iptc4xmpExt="http://iptc.org/std/Iptc4xmpExt/2008-02-29/";
photoshop:City="London"
> > various
> > PDF objects you're interested in I could come up with a quick
> > sample
> > for Tim.
> >
> > BR
> > Maruan
> >
> > Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
> > > Hi Leo
t you're looking for?
> >
> Just getting those elements would be a great start. If you could also
> include the rest of the dictionary in which it was found (or at least the
> /Type and /Subtype keys, if present) would be great!
>
> Leonard
>
> On 3/17/21, 1:
r, etc.
>
> Leonard
>
> On 3/17/21, 11:37 AM, "Tim Allison" wrote:
>
> All,
>
> I'm scraping XMPs out of our corpus and placing them here as standalone
> files:
>
>
> https://nam04.safelinks.protection.outlook.com
All,
I'm scraping XMPs out of our corpus and placing them here as standalone files:
https://corpora.tika.apache.org/base/xmps/
I've binned the files roughly based on the container file's mime
type, e.g. https://corpora.tika.apache.org/base/xmps/pdf/
The process is still running, and I vie
shortly...
> unless there’s a multi threading issue
Narrator: There is.
On Fri, Mar 12, 2021 at 6:23 AM Tim Allison wrote:
>
> I have no explanation for that unless there’s a multi threading issue. I only
> changed files in libvips...and added some other jpeg trackers.
>
> On Thu, Mar
I have no explanation for that unless there’s a multi threading issue. I
only changed files in libvips...and added some other jpeg trackers.
On Thu, Mar 11, 2021 at 11:10 PM Tilman Hausherr
wrote:
> Maybe related to my previous mail:
> When doing the PDFBox regression tests, the files
>
> bug_tr
Argh...my fault sorry. I just overwrote the libvips subdirectory.
Sorry! Will be more careful next time.
On Thu, Mar 11, 2021 at 2:27 PM Tilman Hausherr wrote:
>
> The file
>
> /bug_trackers/libvips/libvips-LINK-459-0.pdf
>
> is empty and has a date of March 9, 2021 (a few other files there
Appears to be a success. I also updated datasette, and the csv/json
links still work.
On Tue, Jan 26, 2021 at 6:59 AM Tim Allison wrote:
>
> And rebooting shortly...
And rebooting shortly...
Thanks to a recommendation from a user and the developer of datasette, I
configured the proxy correctly so that this now works:
https://corpora.tika.apache.org/datasette/
Make sure to include the final /. https://corpora.tika.apache.org/datasette
does not work.
Cheers,
Tim
Looks like no one is on. Fingers crossed for success...
Successfully ran updates and rebooted.
I also pulled the latest datasette, and it still has the same issues.
On Wed, Dec 9, 2020 at 6:49 AM Tim Allison wrote:
> Will run updates and reboot shortly.
>
Will run updates and reboot shortly.
I updated the documentation for running datasette:
https://cwiki.apache.org/confluence/display/TIKA/VirtualMachine
On Tue, Dec 8, 2020 at 7:56 PM Tim Allison wrote:
> Nick,
> See:
> https://github.com/simonw/datasette/issues/1091
>
> If we’re doing something wrong on our end
Nick,
See:
https://github.com/simonw/datasette/issues/1091
If we’re doing something wrong on our end, please let us know!
On Tue, Dec 8, 2020 at 2:23 PM Nick Burch wrote:
> Hi All
>
> I'm having some issues with the datasette instance on the vm. The main
> table pages are working, but csv/js
All,
Some updates...please see Peter Wyatt's recent article on the refreshing
of the bug tracker corpus:
https://twitter.com/PDFAssociation/status/1327237439732260865?s=20
* successfully upgraded and rebooted the server.
* finished running tika-eval's new FileProfile on the full corpus, and I'
All,
It looks like datasette now works with the proxy! The backing db is out
of date, though. So, I'm going to run a first pass with tika-eval's
FileProfile that will allow users to search by container file
size/mime/hash. Once that completes I'll restart datasette with that data,
and then we c
Will upgrade datasette while I'm at it. I _think_ the url issue has been
fixed. :fingers-crossed:
Please do take a look at the new bugtracker data and let me know what you
think.
Cheers,
Tim
On Thu, Nov 12, 2020 at 8:18 AM Tim Allison wrote:
> All,
>
> I'v
All,
I've finished a repull of issue tracker data for now. I'm going to run
updates on the vm and reboot in the next few hours.
Best,
Tim
Will repackage PDFs as I did before for the PDF enthusiasts in the crowd. :D
Have a great weekend!
Cheers,
Tim
On Fri, Nov 6, 2020 at 4:39 PM Tim Allison wrote:
> Files are updated under
> https://corpora.tika.apache.org/base/docs/bug_trackers/
>
> I updated the REA
Files are updated under
https://corpora.tika.apache.org/base/docs/bug_trackers/
I updated the README:
https://corpora.tika.apache.org/base/docs/bug_trackers/README.txt
Let me know if you find any surprises.
On Fri, Nov 6, 2020 at 10:05 AM Tim Allison wrote:
> All,
> With many tha
All,
With many thanks to Apache's infra, I was unbanned after a few too many
requests to Apache's JIRA/bugzilla.
I'm currently doing some post processing cleanup on the refreshed
corpus. I'm planning to remove .diff files and zero-byte files. If there
are any objections, let me know soon.
T
hat is complete, I'll update the vm and restart.
Onwards!
Cheers,
Tim
On Mon, Oct 26, 2020 at 1:28 PM Tim Allison wrote:
> All,
>
> I plan on refreshing our bug tracker data over this week. I've added
> some new sources (thank you, Tilman, for the pointers!),
All,
I plan on refreshing our bug tracker data over this week. I've added
some new sources (thank you, Tilman, for the pointers!), and I'd like to
refresh the earlier sources.
If anyone has any bugzilla/jira/github/gitlab based crawlers we should
add, please let me know.
I plan to wipe ou
All,
Peter Wyatt recently wrote an article about our bug tracker corpus for
the PDF Association. W00t! Thank you, Peter!
https://twitter.com/PDFAssociation/status/1304395379488759819
Cheers,
Tim
Looks like no one is on. I'm going to run updates and reboot shortly.
ow how I can improve the documentation or anything else.
Cheers,
Tim
On Mon, Aug 10, 2020 at 9:47 AM Tim Allison wrote:
> Working on this now. Will post update when documentation is ready.
>
> On Wed, Aug 5, 2020 at 3:30 PM Andreas Lehmkuehler
> wrote:
>
>>
Working on this now. Will post update when documentation is ready.
On Wed, Aug 5, 2020 at 3:30 PM Andreas Lehmkuehler wrote:
> Am 05.08.20 um 17:19 schrieb Tim Allison:
> > Y, that's pretty close.
> >
> > Unfortunately, I'm away from my dev environment and can
Y, that's pretty close.
Unfortunately, I'm away from my dev environment and can't access the vm to
confirm. I don't think I got the list of pdf files into a location where
you can see it. WIth enough permissions, you should be able to see it in
/data/work (???)...argh.
I'm sorry for not getting
Thanks Tim.
> Will do, as for files, in my case they are from customer and I don't want
> to share them.
> Thanks
>
> On Wed, Jul 29, 2020, 19:06 Tim Allison wrote:
>
>> Looks like I identified that one i
>> <https://bz.apache.org/bugzilla/show_bug.cg
, please
>
> Thanks in advance!
>
> Am 28.07.20 um 12:45 schrieb Tim Allison:
> > Y. I can run these today
> >
> > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to run the
Y. I can run these today
On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler
wrote:
> Hi,
>
> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> SNAPSHOT) on
> our new box? Does anyone had the cycles to prepare something ready to
> start?
>
> If not, is there anything I can do
Upgraded, rebooted, restarted datasette. All looks good.
On Thu, Jul 23, 2020 at 7:34 AM Tim Allison wrote:
> If no one has any jobs running and no one objects, I’ll patch and reboot
> around 6pm UTC today.
>
> Cheers,
> Tim
>
If no one has any jobs running and no one objects, I’ll patch and reboot
around 6pm UTC today.
Cheers,
Tim
All,
I received a request to package some of the bugtracker data for easier
downloading. For this request, I've zipped the PDFs and FDFs from the
bugtrackers and made those zips available here:
https://corpora.tika.apache.org/base/packaged/pdfs/
I don't think we'll be inundated with one-of
Got distracted yesterday. Will reboot with updates in 30 minutes.
On Tue, Jun 23, 2020 at 12:45 PM Tim Allison wrote:
> I don't see anyone on or anything running. I'm going to run some upgrades
> and reboot in the next couple of hours.
>
> Let me know if you'd li
https://github.com/simonw/datasette/issues/865
On Wed, Jun 24, 2020 at 8:32 AM Tim Allison wrote:
> Y, that's what I'm doing.
>
> ProxyPreserveHost On
> ProxyPass /datasette http://x.y.z.q:
> ProxyPassReverse /datasette http://x.y.z.q:
>
> I star
;apply".
Given that full sql queries seem to work, let's leave this up as is to show
the datasette team what's going wrong?
On Wed, Jun 24, 2020 at 7:40 AM Nick Burch wrote:
> On Wed, 24 Jun 2020, Tim Allison wrote:
> > Thank you, Maruan!
> >
> > I’ll open a t
Thank you, Maruan!
I’ll open a ticket w datasette.
Is it ok if we rollback to opening a port for datasette so that it works
until we can get a fix?
On Wed, Jun 24, 2020 at 4:57 AM Maruan Sahyoun
wrote:
>
>
>
> > Done. Thank you!
> >
> > https://corpora.tika.apache.org/datasette
>
> thx - ther
Done. Thank you!
https://corpora.tika.apache.org/datasette
On Tue, Jun 23, 2020 at 4:09 PM Maruan Sahyoun
wrote:
>
> Could we propably proxy everything using apache and not expose additional
> ports? In addition use HTTPS throughout?
>
> BR
> Maruan
>
> > All,
> >
> > I created one big sqlit
All,
I created one big sqlite db with tika-eval profile tables and mimes from
tika and file and loaded them into datasette run via Docker.
http://corpora.tika.apache.org:8001/
How should I document example sql...on our wiki? What queries do we
want? What do people want to know?
The c
I don't see anyone on or anything running. I'm going to run some upgrades
and reboot in the next couple of hours.
Let me know if you'd like more of a heads up or if there is a better
notification/procedure for this.
Cheers,
Tim
I should clarify that I made the Profile db file searchable:
https://corpora.tika.apache.org/base/metadata/tika-eval/tika-eval-1.24.1.mv.db.gz.
I didn't load the mimes csvs, but I can certainly do that as well.
On Wed, Jun 17, 2020 at 11:33 AM Tim Allison wrote:
> All,
>
> I
Are there any objections to opening a port and launching this on our
server? If no objections, any preference for port?
Cheers,
Tim
On Wed, Jun 17, 2020 at 9:04 AM Tim Allison wrote:
> Downloading the entire db and then running it locally with unfamiliar code
> i
Downloading the entire db and then running it locally with unfamiliar code
isn’t easy enough?!
But seriously, will look into Datasette. Thank you!
Happy to set up Postgres as well.
On Wed, Jun 17, 2020 at 8:19 AM Nick Burch wrote:
> Hi All
>
> As I understand it (which might be wrong!), Tim is
All,
I added minimal README.txt|html files under
https://corpora.tika.apache.org/base/metadata.
Please let me know what you think.
I'm currently running Tika for another mime table.
Cheers,
Tim
+1 and thank you!
On Tue, Jun 16, 2020 at 2:45 AM Maruan Sahyoun
wrote:
> Hi,
>
> there were several Linux updates installed to corpora.tika.apache.org. In
> order to finish the installation there needs to be a
> reboot of the server.
>
> I'm planning that for June 17, 2020 06:00 UTC.
>
> If tha
Or
> > > for
> > > > > > /usr/share/metadata too?
> > > > > >
> > > > > > BR
> > > > > > Maruan
> > > > > > > On Wed, Jun 10, 2020 at 2:44 PM Maruan Sahyoun <
> > > > sahy...@filea
t; > On Wed, Jun 10, 2020 at 2:44 PM Maruan Sahyoun <
> > sahy...@fileaffairs.de>
> > > > > wrote:
> > > > >
> > > > > > > Separate question...
> > > > > > >
> > > > > > > The 6TB drive is mount
; the
> > > data
> > > > > > there even though that is, um, non-traditional.
> > > > > >
> > > > > > Should we move /home to / and mount the 6TB drive to, say, /data.
> > > Then
> > > > > we
> > &g
home, which is why I initially put the
> data
> > > > there even though that is, um, non-traditional.
> > > >
> > > > Should we move /home to / and mount the 6TB drive to, say, /data.
> Then
> > > we
> > > > could link the doc
gt; > could link the docs under /usr/share/corpora.
>
> feel free to move to what best suits your needs.
>
> BR
> Maruan
>
> >
> > On Wed, Jun 10, 2020 at 2:01 PM Tim Allison wrote:
> >
> > > Thank you, Maruan!
> > >
> > >
Tim Allison wrote:
> Thank you, Maruan!
>
> I'm moving the data over now.
>
> We should add some other folders, e.g. metadata/.
>
> Do we want
>
> /usr/share/corpora/docs
> /usr/share/corpora/metadata
>
> or
>
> /usr/share/corpora
> /usr/shar
Thank you, Maruan!
I'm moving the data over now.
We should add some other folders, e.g. metadata/.
Do we want
/usr/share/corpora/docs
/usr/share/corpora/metadata
or
/usr/share/corpora
/usr/share/metadata
On Wed, Jun 10, 2020 at 1:47 PM Maruan Sahyoun
wrote:
> Hi,
>
> I've done the follow
/home concerned me as well
+1 to /usr
On Fri, Jun 5, 2020 at 12:19 PM Maruan Sahyoun
wrote:
>
> > I think so? Unless there are any objections?
> >
> > On Fri, Jun 5, 2020 at 10:58 AM Maruan Sahyoun
> > wrote:
> >
> > > Hi,
> > >
> > > > All,
> > > >
> > > > I moved the data over over the la
I think so? Unless there are any objections?
On Fri, Jun 5, 2020 at 10:58 AM Maruan Sahyoun
wrote:
> Hi,
>
> > All,
> >
> > I moved the data over over the last few days. For now, it is in
> >
> > /home/work/data/docs
>
> will the files stay in that location? Can point the web browsing to tha
All,
I moved the data over over the last few days. For now, it is in
/home/work/data/docs
I ran 'file' on all the files and put a table of mime types in
/home/work/metadata.
Please don't tell anyone I didn't run Tika for file detection...just yet.
:D
I think I chgrp'd all the files to
@Compress devs,
We recently transitioned our vm to a new provider, and we're improving
the ASF-itude of this project. We recently started a new email list for
those interested in guiding and using the 2 TB of files that we've gathered
so far.
Please join corpora-dev@tika.apache.org if you ha
All,
If you have an interest in guiding the ongoing development of the
regression corpus vm, please join the new mailing list:
corpora-dev@tika.apache.org via the usual means:
corpora-dev-subscr...@tika.apache.org
Unless there are objections, we can continue to use the regular Tika JIRA
to tr
All,
With the transition from our former vm provider to a new vm, we have the
chance to start fresh. Adding a mailing list is one of the ways we’re
improving/ASF’ifying the processes around the management of our regression
testing vm and its corpora.
We welcome the world to participate, whether o
84 matches
Mail list logo