Hi Amirouche -- is this for an offline search?  Would love to read more
about it.

On Sun, Nov 1, 2020 at 6:36 AM Amirouche Boubekki <
amirouche.boube...@gmail.com> wrote:

> Hello,
>
>
> I am working on a search engine (unlike sphinx or elastic search, more
> like bing or google), I was planning to use .zim files to feed the
> index, the problem is there is no systematic way to find the original
> URL of the documents.
>
> I am wondering whether one of the following will be possible for kiwix
> project to do:
>
> A) Add a <meta url="https://foobar";> in the html inside the .zim files,
>
> A bis) Add a metadata field per document with the original url inside
> the .zim files,
>
> B) Publish .warc files of wikipedia, stackoverflow dumps etc... so
> that people like myself can re-use those. WARC files are more useful
> than .zim files but still less user friendly than the following
> proposal...
>
> C) ... One last alternative, is to pivot the custom .zim file storage
> to an okvs [0] like rocksdb or sqlite lsm extension [1]. The idea is
> to make it very easy to access the kiwix dumps from many programming
> languages unlike the current approach that is limited to C++ and
> Python. Also, it will be easier to extend a given dump with custom
> fields, unlike the current .zim which seems to be read-only.
>
> Let me know what you think :-)
>
> Thanks in advance!
>
> [0] https://en.wikipedia.org/wiki/Ordered_Key-Value_Store
> [1] https://github.com/sqlite/sqlite/tree/master/ext/lsm1
>
> Le jeu. 29 oct. 2020 à 12:14, Emmanuel Engelhart <kel...@kiwix.org> a
> écrit :
> >
> > Hi
> >
> > I'm very proud to announce the release of our new tool: warc2zim.
> >
> > Warc2zim is a command line tool for GNU/Linux and macOS which allows to
> > convert a WARC file to a ZIM file. WARC being a widely used storage
> > format of the archive world, warc2zim offers new opportunities to reuse
> > WARC stored data and benefit of the whole feature set of the ZIM file
> > format and readers like Kiwix.
> >
> > The tool has been achieved with the strong collaboration of the
> > Webrecorder team. It is one milestone of a bigger project called Zimit,
> > a project we run we the sponsoring of the Mozilla Foundation.
> >
> > The ZIM created using that process works slightly differently than the
> > traditional ones (the ZIM specification is formally respected). We are
> > currently running an effort to update all the Kiwix readers, but it
> > already works well with Kiwix Serve.
> >
> > The tool is distributed at:
> > https://pypi.org/project/warc2zim/
> >
> > More news to come about warc2zim and Zimit in January 2020.
> >
> > Happy scraping!
> > Happy coding!
> > Happy offline reading!
> >
> > Emmanuel
> >
> > --
> > Kiwix - Wikipedia Offline & more
> > * Web: https://kiwix.org/
> > * Twitter: https://twitter.com/KiwixOffline
> > * Wiki: https://wiki.kiwix.org/
> >
> > _______________________________________________
> > Offline-l mailing list
> > Offline-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/offline-l
>
>
>
> --
> Amirouche ~ https://hyper.dev
>
> _______________________________________________
> Offline-l mailing list
> Offline-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/offline-l
>


-- 
Samuel Klein          @metasj           w:user:sj          +1 617 529 4266
_______________________________________________
Offline-l mailing list
Offline-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/offline-l

Reply via email to