java installs

2025-05-08 Thread Tim Allison
All, I just added temurin-17 and temurin-21 to the vm. I'm going to set the default to temurin-17. Best, Tim

Status of the public facing webserver

2025-05-08 Thread Tim Allison
All, The lazy consensus on https://issues.apache.org/jira/browse/LEGAL-696 was to keep the vm running, but to limit access to committers. The public file server will remain shutdown. If anyone has any other options, please let us know. Thank you, again, Maruan, for continuing to fund the vm!

Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

2025-01-16 Thread Tim Allison
ilable at one time or another. > > I could be wrong, but Privacy is knows GPDR while LEGAL knows the AL2, etc. > > > On Jan 16, 2025, at 8:26 AM, Tim Allison wrote: > > > > This is a really helpful delineation of the issues. Thank you, Maruan, > for > > thi

Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

2025-01-16 Thread Tim Allison
ccess in a wider context > and even need to reconsider if we can use it at all although being very > beneficial. > > Maybe we can have a chat with legal about that. > > BR > Maruan > > > > > Am Dienstag, dem 14.01.2025 um 08:17 -0500 schrieb Tim Allison: > >

Where do we go from here? WAS: Turning off public access to the regression corpora?

2025-01-14 Thread Tim Allison
s from > your server, as the official govdocs1 sources do not expose the single > PDF files directly. > > Thanks for hosting these files in the past. > > Best regards, > Stefan > > On 2025/01/09 01:36:59 Tim Allison wrote: > > \All, > > We've gotten a handfu

Re: Turning off public access to the regression corpora?

2025-01-09 Thread Tim Allison
; > > this is unfortunate but as this is posing the risk of legal actions to > the ASF but also to me hosting the site I think we should stop that. > > > > BR > > Maruan > > > >> Am 09.01.2025 um 02:37 schrieb Tim Allison : > >> > >> \All

Fwd: Turning off public access to the regression corpora?

2025-01-08 Thread Tim Allison
\All, We've gotten a handful of takedown requests recently. I had initially envisioned public sharing of files as a key component of our server. We can still use the files and offer read access to fellow file researchers. I'm not sure I want to deal with further takedown requests. As an intermedi

Re: Refreshing the common crawl-derived PDFs in our regression corpus?

2023-07-14 Thread Tim Allison
ing up and funding the corpora server! Cheers, Tim On Sat, May 20, 2023 at 5:44 AM Andreas Lehmkuehler wrote: > Hi, > > Am 19.05.23 um 17:25 schrieb Tim Allison: > > All, > > > >Tilman Hausherr mentioned that we might want to update the > > common

Refreshing the common crawl-derived PDFs in our regression corpus?

2023-05-19 Thread Tim Allison
All, Tilman Hausherr mentioned that we might want to update the common-crawl pdfs in our regression corpus. This proposal leaves the bugtracker PDFs as they are. For the CC-based PDFs, we could: 1) remove existing truncated pdfs 2) fold in newer untruncated PDFs from: https://digitalcorpora.

Re: Datasets for testing large number of attachments

2022-07-26 Thread Tim Allison
As a warning, tho, Common Crawl truncates files at 1MB, so we have a bunch of truncated files. We refetched some and put those under commoncrawl3_refetched. On Tue, Jul 26, 2022 at 2:58 PM Tim Allison wrote: > We have ~1.9TB. But I'd skip cc_large because that's just a

Re: Datasets for testing large number of attachments

2022-07-26 Thread Tim Allison
173 text/x-vcalendar1156 On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr wrote: > We were thinking something around 2TB of data with a good mix of excel, > images, pdfs, text and powerpoints. So I guess a mix of everything. > > > > *From: *Tim Allison > *Date: *Tuesday, Jul

Re: Datasets for testing large number of attachments

2022-07-26 Thread Tim Allison
What Nick said... cc_large is a sample of some of the larger documents from commoncrawl3_refetched. If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits

Re: regarding the data bank of test PDF files (pdfs_202011) . . .

2022-06-29 Thread Tim Allison
Hi Albretch, Thank you for the pointer. The PDFs that I packaged there were gathered from various bug trackers [1] and [2]. I'm not necessarily against gathering the Regents exams, but they would represent a different purpose. Best, Tim [1] https://www.pdfa.org/a-new-

Re: Apache Tika Meetup group

2021-10-19 Thread Tim Allison
And, yes... the link does not yet work unless you're me or have my login info... Should be up by tomorrow... On Tue, Oct 19, 2021 at 9:36 AM Tim Allison wrote: > > All, > I started an Apache Tika Community Meetup group: > https://www.meetup.com/apache-tika-community/ >

Apache Tika Meetup group

2021-10-19 Thread Tim Allison
All, I started an Apache Tika Community Meetup group: https://www.meetup.com/apache-tika-community/ Let's try this out for semi-regular meetups to talk about where we are and where we're headed. The scope of these meetups should be fairly broad to include all areas of file processing (including

JXLs from Common Crawl CC-MAIN-2021-31

2021-10-01 Thread Tim Allison
Possibly in response to a recent PDF Days talk (?)[0], Micky Lindlar asked on twitter if anyone had seen JPEG XL files in the wild[1]. I added jxl detection to Tika and re-detected all the files that had been previously identified as "application/octet-stream". I found ~462 likely jxl files. I h

Ran upgrades and rebooted server

2021-09-14 Thread Tim Allison
I _think_ it all works as it should. I also updated datasette and relaunched. Onwards! Cheers, Tim

Re: XMPs...all you could possibly want...and more!

2021-03-19 Thread Tim Allison
cessing or packaging, please let me know. Cheers, Tim On Wed, Mar 17, 2021 at 4:21 PM Tim Allison wrote: > > > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd > > be of any interest. > > If there were a commandline or a Ja

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
> Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd be > of any interest. If there were a commandline or a Java SDK, I could run that next if that'd be of any interest. :D On Wed, Mar 17, 2021 at 3:28 PM Tim Allison wrote: > > Ah, I wasn'

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
> formats. It's what all the Adobe apps use to handle XMP in any file format > that we encounter. > > Leonard > > On 3/17/21, 2:48 PM, "Tim Allison" wrote: > > Wait...I'm sorry...I'm wrong on the first point. > > 1) in Tika genera

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
ther metadata to XMP in our tika-xmp module, and xmpcore is a dependency of Drew Noakes' metadata-extractor which is critical. On Wed, Mar 17, 2021 at 2:43 PM Tim Allison wrote: > > >Isn't that why are you using the XMP Toolkit??? > > Sorry, we may be talking about two d

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
l wrote: > > >The other thing is that I wanted to scrape xmp out of files beyond PDFs. > > > Isn't that why are you using the XMP Toolkit??? > > Leonard > > On 3/17/21, 2:10 PM, "Tim Allison" wrote: > > > ARGH Please don't

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
"http://ns.fotoware.com/iptcxmp-custom/1.0/"; xmlns:fwu="http://ns.fotoware.com/iptcxmp-user/1.0/"; xmlns:dc="http://purl.org/dc/elements/1.1/"; xmlns:Iptc4xmpExt="http://iptc.org/std/Iptc4xmpExt/2008-02-29/"; photoshop:City="London"

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
> > various > > PDF objects you're interested in I could come up with a quick > > sample > > for Tim. > > > > BR > > Maruan > > > > Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison: > > > Hi Leo

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
t you're looking for? > > > Just getting those elements would be a great start. If you could also > include the rest of the dictionary in which it was found (or at least the > /Type and /Subtype keys, if present) would be great! > > Leonard > > On 3/17/21, 1:

Re: XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
r, etc. > > Leonard > > On 3/17/21, 11:37 AM, "Tim Allison" wrote: > > All, > > I'm scraping XMPs out of our corpus and placing them here as standalone > files: > > > https://nam04.safelinks.protection.outlook.com

XMPs...all you could possibly want...and more!

2021-03-17 Thread Tim Allison
All, I'm scraping XMPs out of our corpus and placing them here as standalone files: https://corpora.tika.apache.org/base/xmps/ I've binned the files roughly based on the container file's mime type, e.g. https://corpora.tika.apache.org/base/xmps/pdf/ The process is still running, and I vie

running updates and rebooting the vm

2021-03-12 Thread Tim Allison
shortly...

Re: DSS-2058-3.pdf and DSS-2058-4.pdf

2021-03-12 Thread Tim Allison
> unless there’s a multi threading issue Narrator: There is. On Fri, Mar 12, 2021 at 6:23 AM Tim Allison wrote: > > I have no explanation for that unless there’s a multi threading issue. I only > changed files in libvips...and added some other jpeg trackers. > > On Thu, Mar

Re: DSS-2058-3.pdf and DSS-2058-4.pdf

2021-03-12 Thread Tim Allison
I have no explanation for that unless there’s a multi threading issue. I only changed files in libvips...and added some other jpeg trackers. On Thu, Mar 11, 2021 at 11:10 PM Tilman Hausherr wrote: > Maybe related to my previous mail: > When doing the PDFBox regression tests, the files > > bug_tr

Re: files changed

2021-03-11 Thread Tim Allison
Argh...my fault sorry. I just overwrote the libvips subdirectory. Sorry! Will be more careful next time. On Thu, Mar 11, 2021 at 2:27 PM Tilman Hausherr wrote: > > The file > > /bug_trackers/libvips/libvips-LINK-459-0.pdf > > is empty and has a date of March 9, 2021 (a few other files there

Re: Updating server

2021-01-26 Thread Tim Allison
Appears to be a success. I also updated datasette, and the csv/json links still work. On Tue, Jan 26, 2021 at 6:59 AM Tim Allison wrote: > > And rebooting shortly...

Updating server

2021-01-26 Thread Tim Allison
And rebooting shortly...

datasette configuration fixed

2021-01-11 Thread Tim Allison
Thanks to a recommendation from a user and the developer of datasette, I configured the proxy correctly so that this now works: https://corpora.tika.apache.org/datasette/ Make sure to include the final /. https://corpora.tika.apache.org/datasette does not work. Cheers, Tim

updating and rebooting in a few minutes

2021-01-06 Thread Tim Allison
Looks like no one is on. Fingers crossed for success...

Re: reboot

2020-12-09 Thread Tim Allison
Successfully ran updates and rebooted. I also pulled the latest datasette, and it still has the same issues. On Wed, Dec 9, 2020 at 6:49 AM Tim Allison wrote: > Will run updates and reboot shortly. >

reboot

2020-12-09 Thread Tim Allison
Will run updates and reboot shortly.

Re: Datasette instance problems when proxied to main site?

2020-12-09 Thread Tim Allison
I updated the documentation for running datasette: https://cwiki.apache.org/confluence/display/TIKA/VirtualMachine On Tue, Dec 8, 2020 at 7:56 PM Tim Allison wrote: > Nick, > See: > https://github.com/simonw/datasette/issues/1091 > > If we’re doing something wrong on our end

Re: Datasette instance problems when proxied to main site?

2020-12-08 Thread Tim Allison
Nick, See: https://github.com/simonw/datasette/issues/1091 If we’re doing something wrong on our end, please let us know! On Tue, Dec 8, 2020 at 2:23 PM Nick Burch wrote: > Hi All > > I'm having some issues with the datasette instance on the vm. The main > table pages are working, but csv/js

Updates/status

2020-11-13 Thread Tim Allison
All, Some updates...please see Peter Wyatt's recent article on the refreshing of the bug tracker corpus: https://twitter.com/PDFAssociation/status/1327237439732260865?s=20 * successfully upgraded and rebooted the server. * finished running tika-eval's new FileProfile on the full corpus, and I'

datasette

2020-11-12 Thread Tim Allison
All, It looks like datasette now works with the proxy! The backing db is out of date, though. So, I'm going to run a first pass with tika-eval's FileProfile that will allow users to search by container file size/mime/hash. Once that completes I'll restart datasette with that data, and then we c

Re: server reboot in the next few hours

2020-11-12 Thread Tim Allison
Will upgrade datasette while I'm at it. I _think_ the url issue has been fixed. :fingers-crossed: Please do take a look at the new bugtracker data and let me know what you think. Cheers, Tim On Thu, Nov 12, 2020 at 8:18 AM Tim Allison wrote: > All, > > I'v

server reboot in the next few hours

2020-11-12 Thread Tim Allison
All, I've finished a repull of issue tracker data for now. I'm going to run updates on the vm and reboot in the next few hours. Best, Tim

Re: updating bug corpus

2020-11-06 Thread Tim Allison
Will repackage PDFs as I did before for the PDF enthusiasts in the crowd. :D Have a great weekend! Cheers, Tim On Fri, Nov 6, 2020 at 4:39 PM Tim Allison wrote: > Files are updated under > https://corpora.tika.apache.org/base/docs/bug_trackers/ > > I updated the REA

Re: updating bug corpus

2020-11-06 Thread Tim Allison
Files are updated under https://corpora.tika.apache.org/base/docs/bug_trackers/ I updated the README: https://corpora.tika.apache.org/base/docs/bug_trackers/README.txt Let me know if you find any surprises. On Fri, Nov 6, 2020 at 10:05 AM Tim Allison wrote: > All, > With many tha

updating bug corpus

2020-11-06 Thread Tim Allison
All, With many thanks to Apache's infra, I was unbanned after a few too many requests to Apache's JIRA/bugzilla. I'm currently doing some post processing cleanup on the refreshed corpus. I'm planning to remove .diff files and zero-byte files. If there are any objections, let me know soon. T

Re: updating bug tracker data

2020-10-26 Thread Tim Allison
hat is complete, I'll update the vm and restart. Onwards! Cheers, Tim On Mon, Oct 26, 2020 at 1:28 PM Tim Allison wrote: > All, > > I plan on refreshing our bug tracker data over this week. I've added > some new sources (thank you, Tilman, for the pointers!),

updating bug tracker data

2020-10-26 Thread Tim Allison
All, I plan on refreshing our bug tracker data over this week. I've added some new sources (thank you, Tilman, for the pointers!), and I'd like to refresh the earlier sources. If anyone has any bugzilla/jira/github/gitlab based crawlers we should add, please let me know. I plan to wipe ou

Bug-tracker corpus in the news

2020-09-11 Thread Tim Allison
All, Peter Wyatt recently wrote an article about our bug tracker corpus for the PDF Association. W00t! Thank you, Peter! https://twitter.com/PDFAssociation/status/1304395379488759819 Cheers, Tim

updating and rebooting shortly

2020-09-02 Thread Tim Allison
Looks like no one is on. I'm going to run updates and reboot shortly.

Re: Access to corpora server to run regression tests

2020-08-10 Thread Tim Allison
ow how I can improve the documentation or anything else. Cheers, Tim On Mon, Aug 10, 2020 at 9:47 AM Tim Allison wrote: > Working on this now. Will post update when documentation is ready. > > On Wed, Aug 5, 2020 at 3:30 PM Andreas Lehmkuehler > wrote: > >>

Re: Access to corpora server to run regression tests

2020-08-10 Thread Tim Allison
Working on this now. Will post update when documentation is ready. On Wed, Aug 5, 2020 at 3:30 PM Andreas Lehmkuehler wrote: > Am 05.08.20 um 17:19 schrieb Tim Allison: > > Y, that's pretty close. > > > > Unfortunately, I'm away from my dev environment and can&#

Re: Access to corpora server to run regression tests

2020-08-05 Thread Tim Allison
Y, that's pretty close. Unfortunately, I'm away from my dev environment and can't access the vm to confirm. I don't think I got the list of pdf files into a location where you can see it. WIth enough permissions, you should be able to see it in /data/work (???)...argh. I'm sorry for not getting

Re: Error when parsing of Excel files

2020-07-29 Thread Tim Allison
Thanks Tim. > Will do, as for files, in my case they are from customer and I don't want > to share them. > Thanks > > On Wed, Jul 29, 2020, 19:06 Tim Allison wrote: > >> Looks like I identified that one i >> <https://bz.apache.org/bugzilla/show_bug.cg

Re: PDFBox regression tests?

2020-07-28 Thread Tim Allison
, please > > Thanks in advance! > > Am 28.07.20 um 12:45 schrieb Tim Allison: > > Y. I can run these today > > > > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler > > wrote: > > > >> Hi, > >> > >> is there any chance to run the

Re: PDFBox regression tests?

2020-07-28 Thread Tim Allison
Y. I can run these today On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler wrote: > Hi, > > is there any chance to run the PDFBox regression tests (2.0.20 vs. > SNAPSHOT) on > our new box? Does anyone had the cycles to prepare something ready to > start? > > If not, is there anything I can do

Re: Reboot

2020-07-24 Thread Tim Allison
Upgraded, rebooted, restarted datasette. All looks good. On Thu, Jul 23, 2020 at 7:34 AM Tim Allison wrote: > If no one has any jobs running and no one objects, I’ll patch and reboot > around 6pm UTC today. > > Cheers, > Tim >

Reboot

2020-07-23 Thread Tim Allison
If no one has any jobs running and no one objects, I’ll patch and reboot around 6pm UTC today. Cheers, Tim

packaged subsets

2020-06-26 Thread Tim Allison
All, I received a request to package some of the bugtracker data for easier downloading. For this request, I've zipped the PDFs and FDFs from the bugtrackers and made those zips available here: https://corpora.tika.apache.org/base/packaged/pdfs/ I don't think we'll be inundated with one-of

Re: software upgrades/restart

2020-06-24 Thread Tim Allison
Got distracted yesterday. Will reboot with updates in 30 minutes. On Tue, Jun 23, 2020 at 12:45 PM Tim Allison wrote: > I don't see anyone on or anything running. I'm going to run some upgrades > and reboot in the next couple of hours. > > Let me know if you'd li

Re: datasette is live

2020-06-24 Thread Tim Allison
https://github.com/simonw/datasette/issues/865 On Wed, Jun 24, 2020 at 8:32 AM Tim Allison wrote: > Y, that's what I'm doing. > > ProxyPreserveHost On > ProxyPass /datasette http://x.y.z.q: > ProxyPassReverse /datasette http://x.y.z.q: > > I star

Re: datasette is live

2020-06-24 Thread Tim Allison
;apply". Given that full sql queries seem to work, let's leave this up as is to show the datasette team what's going wrong? On Wed, Jun 24, 2020 at 7:40 AM Nick Burch wrote: > On Wed, 24 Jun 2020, Tim Allison wrote: > > Thank you, Maruan! > > > > I’ll open a t

Re: datasette is live

2020-06-24 Thread Tim Allison
Thank you, Maruan! I’ll open a ticket w datasette. Is it ok if we rollback to opening a port for datasette so that it works until we can get a fix? On Wed, Jun 24, 2020 at 4:57 AM Maruan Sahyoun wrote: > > > > > Done. Thank you! > > > > https://corpora.tika.apache.org/datasette > > thx - ther

Re: datasette is live

2020-06-23 Thread Tim Allison
Done. Thank you! https://corpora.tika.apache.org/datasette On Tue, Jun 23, 2020 at 4:09 PM Maruan Sahyoun wrote: > > Could we propably proxy everything using apache and not expose additional > ports? In addition use HTTPS throughout? > > BR > Maruan > > > All, > > > > I created one big sqlit

datasette is live

2020-06-23 Thread Tim Allison
All, I created one big sqlite db with tika-eval profile tables and mimes from tika and file and loaded them into datasette run via Docker. http://corpora.tika.apache.org:8001/ How should I document example sql...on our wiki? What queries do we want? What do people want to know? The c

software upgrades/restart

2020-06-23 Thread Tim Allison
I don't see anyone on or anything running. I'm going to run some upgrades and reboot in the next couple of hours. Let me know if you'd like more of a heads up or if there is a better notification/procedure for this. Cheers, Tim

Re: Try datasette for browsing corpa sql reports?

2020-06-17 Thread Tim Allison
I should clarify that I made the Profile db file searchable: https://corpora.tika.apache.org/base/metadata/tika-eval/tika-eval-1.24.1.mv.db.gz. I didn't load the mimes csvs, but I can certainly do that as well. On Wed, Jun 17, 2020 at 11:33 AM Tim Allison wrote: > All, > > I

Re: Try datasette for browsing corpa sql reports?

2020-06-17 Thread Tim Allison
Are there any objections to opening a port and launching this on our server? If no objections, any preference for port? Cheers, Tim On Wed, Jun 17, 2020 at 9:04 AM Tim Allison wrote: > Downloading the entire db and then running it locally with unfamiliar code > i

Re: Try datasette for browsing corpa sql reports?

2020-06-17 Thread Tim Allison
Downloading the entire db and then running it locally with unfamiliar code isn’t easy enough?! But seriously, will look into Datasette. Thank you! Happy to set up Postgres as well. On Wed, Jun 17, 2020 at 8:19 AM Nick Burch wrote: > Hi All > > As I understand it (which might be wrong!), Tim is

minimal metadata

2020-06-16 Thread Tim Allison
All, I added minimal README.txt|html files under https://corpora.tika.apache.org/base/metadata. Please let me know what you think. I'm currently running Tika for another mime table. Cheers, Tim

Re: server reboot planned for June 17th, 2020 06:00 UTC

2020-06-16 Thread Tim Allison
+1 and thank you! On Tue, Jun 16, 2020 at 2:45 AM Maruan Sahyoun wrote: > Hi, > > there were several Linux updates installed to corpora.tika.apache.org. In > order to finish the installation there needs to be a > reboot of the server. > > I'm planning that for June 17, 2020 06:00 UTC. > > If tha

Re: Corpora server setup

2020-06-15 Thread Tim Allison
Or > > > for > > > > > > /usr/share/metadata too? > > > > > > > > > > > > BR > > > > > > Maruan > > > > > > > On Wed, Jun 10, 2020 at 2:44 PM Maruan Sahyoun < > > > > sahy...@filea

Re: Corpora server setup

2020-06-11 Thread Tim Allison
t; > On Wed, Jun 10, 2020 at 2:44 PM Maruan Sahyoun < > > sahy...@fileaffairs.de> > > > > > wrote: > > > > > > > > > > > > Separate question... > > > > > > > > > > > > > > The 6TB drive is mount

Re: Corpora server setup

2020-06-11 Thread Tim Allison
; the > > > data > > > > > > there even though that is, um, non-traditional. > > > > > > > > > > > > Should we move /home to / and mount the 6TB drive to, say, /data. > > > Then > > > > > we > > &g

Re: Corpora server setup

2020-06-10 Thread Tim Allison
home, which is why I initially put the > data > > > > there even though that is, um, non-traditional. > > > > > > > > Should we move /home to / and mount the 6TB drive to, say, /data. > Then > > > we > > > > could link the doc

Re: Corpora server setup

2020-06-10 Thread Tim Allison
gt; > could link the docs under /usr/share/corpora. > > feel free to move to what best suits your needs. > > BR > Maruan > > > > > On Wed, Jun 10, 2020 at 2:01 PM Tim Allison wrote: > > > > > Thank you, Maruan! > > > > > >

Re: Corpora server setup

2020-06-10 Thread Tim Allison
Tim Allison wrote: > Thank you, Maruan! > > I'm moving the data over now. > > We should add some other folders, e.g. metadata/. > > Do we want > > /usr/share/corpora/docs > /usr/share/corpora/metadata > > or > > /usr/share/corpora > /usr/shar

Re: Corpora server setup

2020-06-10 Thread Tim Allison
Thank you, Maruan! I'm moving the data over now. We should add some other folders, e.g. metadata/. Do we want /usr/share/corpora/docs /usr/share/corpora/metadata or /usr/share/corpora /usr/share/metadata On Wed, Jun 10, 2020 at 1:47 PM Maruan Sahyoun wrote: > Hi, > > I've done the follow

Re: data staging/rsync

2020-06-05 Thread Tim Allison
/home concerned me as well +1 to /usr On Fri, Jun 5, 2020 at 12:19 PM Maruan Sahyoun wrote: > > > I think so? Unless there are any objections? > > > > On Fri, Jun 5, 2020 at 10:58 AM Maruan Sahyoun > > wrote: > > > > > Hi, > > > > > > > All, > > > > > > > > I moved the data over over the la

Re: data staging/rsync

2020-06-05 Thread Tim Allison
I think so? Unless there are any objections? On Fri, Jun 5, 2020 at 10:58 AM Maruan Sahyoun wrote: > Hi, > > > All, > > > > I moved the data over over the last few days. For now, it is in > > > > /home/work/data/docs > > will the files stay in that location? Can point the web browsing to tha

data staging/rsync

2020-06-05 Thread Tim Allison
All, I moved the data over over the last few days. For now, it is in /home/work/data/docs I ran 'file' on all the files and put a table of mime types in /home/work/metadata. Please don't tell anyone I didn't run Tika for file detection...just yet. :D I think I chgrp'd all the files to

[COMPRESS] Tika's regression corpus

2020-06-05 Thread Tim Allison
@Compress devs, We recently transitioned our vm to a new provider, and we're improving the ASF-itude of this project. We recently started a new email list for those interested in guiding and using the 2 TB of files that we've gathered so far. Please join corpora-dev@tika.apache.org if you ha

new mailing list for corpora vm

2020-06-05 Thread Tim Allison
All, If you have an interest in guiding the ongoing development of the regression corpus vm, please join the new mailing list: corpora-dev@tika.apache.org via the usual means: corpora-dev-subscr...@tika.apache.org Unless there are objections, we can continue to use the regular Tika JIRA to tr

Hello world!

2020-06-05 Thread Tim Allison
All, With the transition from our former vm provider to a new vm, we have the chance to start fresh. Adding a mailing list is one of the ways we’re improving/ASF’ifying the processes around the management of our regression testing vm and its corpora. We welcome the world to participate, whether o