Re: [Offline-l] [NEW] first releases of openZIM tool warc2zim

Emmanuel Engelhart Tue, 03 Nov 2020 00:31:49 -0800

Hi Amirouche

On 01.11.20 08:52, Amirouche Boubekki wrote:
> I am working on a search engine (unlike sphinx or elastic search, more
> like bing or google), I was planning to use .zim files to feed the
> index, the problem is there is no systematic way to find the original
> URL of the documents.


Yes, with Mediawiki based created ZIM, to an extrem large extened it
should be the same articleId.
> I am wondering whether one of the following will be possible for kiwix
> project to do:
> 
> A) Add a <meta url="https://foobar";> in the html inside the .zim files,

Yes, but there is not always such link possible. Many ZIMs are mash-ups.
That said this is trivial to add such a meta node in in the HTML in
MWoffliner.

> A bis) Add a metadata field per document with the original url inside
> the .zim files,

Yes, I was sure to have such a ticket open at least for MWoffliner. Can
not find it anymore. Probably both approaches are doable, you should
open a ticket in MWoffliner repository.

> B) Publish .warc files of wikipedia, stackoverflow dumps etc... so
> that people like myself can re-use those. WARC files are more useful
> than .zim files but still less user friendly than the following
> proposal...

Why are WARC more useful, beside the fact that they have an exact copy
of of the original Web page?

> C) ... One last alternative, is to pivot the custom .zim file storage
> to an okvs [0] like rocksdb or sqlite lsm extension [1]. The idea is
> to make it very easy to access the kiwix dumps from many programming
> languages unlike the current approach that is limited to C++ and
> Python. Also, it will be easier to extend a given dump with custom
> fields, unlike the current .zim which seems to be read-only.

There is bindings for Go and Javascript as well. Which kind of
additional binding do you need? Creating a binding for libzim takes from
a few days to 2 weeks (depending how you want to make it). This is easy.
If you want to read the content of Wikipedia, this is the easiest
solution (if the current tools are not enough). zimdump allows you as
well to extract extremely efficiently  the content from the command line.

I personally think we should create a fuse driver.

Yes, the ZIM format is readonly. If you want to write content, then you
definitely need a rw DB, whatever its name. That does not mean you have
to replace the ZIM, this can be complementary. An other point is that if
you need a DB, that basically mean you have an additional source of
information you deal with. Something you don't have talked about.

Please open a ticket in a repo if you need anything from the libzim or
MWoffliner.

Regards
Emmanuel

-- 
Kiwix - Wikipedia Offline & more
* Web: https://kiwix.org/
* Twitter: https://twitter.com/KiwixOffline
* Wiki: https://wiki.kiwix.org/

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Offline-l mailing list
Offline-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/offline-l

Re: [Offline-l] [NEW] first releases of openZIM tool warc2zim

Reply via email to