Hi Amirouche -- is this for an offline search? Would love to read more about it.
On Sun, Nov 1, 2020 at 6:36 AM Amirouche Boubekki < amirouche.boube...@gmail.com> wrote: > Hello, > > > I am working on a search engine (unlike sphinx or elastic search, more > like bing or google), I was planning to use .zim files to feed the > index, the problem is there is no systematic way to find the original > URL of the documents. > > I am wondering whether one of the following will be possible for kiwix > project to do: > > A) Add a <meta url="https://foobar"> in the html inside the .zim files, > > A bis) Add a metadata field per document with the original url inside > the .zim files, > > B) Publish .warc files of wikipedia, stackoverflow dumps etc... so > that people like myself can re-use those. WARC files are more useful > than .zim files but still less user friendly than the following > proposal... > > C) ... One last alternative, is to pivot the custom .zim file storage > to an okvs [0] like rocksdb or sqlite lsm extension [1]. The idea is > to make it very easy to access the kiwix dumps from many programming > languages unlike the current approach that is limited to C++ and > Python. Also, it will be easier to extend a given dump with custom > fields, unlike the current .zim which seems to be read-only. > > Let me know what you think :-) > > Thanks in advance! > > [0] https://en.wikipedia.org/wiki/Ordered_Key-Value_Store > [1] https://github.com/sqlite/sqlite/tree/master/ext/lsm1 > > Le jeu. 29 oct. 2020 à 12:14, Emmanuel Engelhart <kel...@kiwix.org> a > écrit : > > > > Hi > > > > I'm very proud to announce the release of our new tool: warc2zim. > > > > Warc2zim is a command line tool for GNU/Linux and macOS which allows to > > convert a WARC file to a ZIM file. WARC being a widely used storage > > format of the archive world, warc2zim offers new opportunities to reuse > > WARC stored data and benefit of the whole feature set of the ZIM file > > format and readers like Kiwix. > > > > The tool has been achieved with the strong collaboration of the > > Webrecorder team. It is one milestone of a bigger project called Zimit, > > a project we run we the sponsoring of the Mozilla Foundation. > > > > The ZIM created using that process works slightly differently than the > > traditional ones (the ZIM specification is formally respected). We are > > currently running an effort to update all the Kiwix readers, but it > > already works well with Kiwix Serve. > > > > The tool is distributed at: > > https://pypi.org/project/warc2zim/ > > > > More news to come about warc2zim and Zimit in January 2020. > > > > Happy scraping! > > Happy coding! > > Happy offline reading! > > > > Emmanuel > > > > -- > > Kiwix - Wikipedia Offline & more > > * Web: https://kiwix.org/ > > * Twitter: https://twitter.com/KiwixOffline > > * Wiki: https://wiki.kiwix.org/ > > > > _______________________________________________ > > Offline-l mailing list > > Offline-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/offline-l > > > > -- > Amirouche ~ https://hyper.dev > > _______________________________________________ > Offline-l mailing list > Offline-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/offline-l > -- Samuel Klein @metasj w:user:sj +1 617 529 4266
_______________________________________________ Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l