Interesting to see that there is now a plan for a phased import and the
first 250 pages have been created.
https://meta.wikimedia.org/wiki/China_Judgments_Online_Preservation_Program#Stage_1:_Micro-pilot_(50-200_pages;_fully_reviewed)
I've only checked a couple pages and they look neat. Some of the layout
templates work better than I would have expected. I've not tested the
search functionality.
I've left a comment on the privacy complaint process on Meta.
Federico
Il 18/12/25 18:34, Federico Leva (Nemo) ha scritto:
Il 06/12/25 13:11, [email protected] ha scritto:
I would like to begin with some background, because many non-Chinese
Wikimedia contributors may not be aware of how significant CJO has
been for judicial transparency in China and how sharply access to it
has been reduced in recent years.
Thanks for this context, it's super interesting!
For our purposes, the important point is this: CJO has removed or
restricted access to large portions of its historical archive,
including documents that were originally public, legally non-
copyrightable under Chinese law, and crucial for understanding the
functioning of China’s legal system. Many judgments that were once
easily verifiable on the official site can no longer be checked
against their original source. These documents are at risk of
disappearing entirely from public access.
How strong is the presumption of copyright-ineligibility? What's the
legal source for it and could it change in the future? (I'm clueless
about the hierarchy of sources of law in China, sorry.)
Have other Wikisources hosted similarly massive, uniform corpora of
government or legal documents? How did you determine whether they fit
the mission of Wikisource? Were there concerns about overwhelming the
project or changing its character?
Nothing as massive, but Italian Wikisource hosts court rulings, usually
when they are especially news-worth. In those cases (think powerful
politicians) there was always someone interested in getting them
removed, but I don't recall whether there were official requests for
redactions. However, we very intentionally do not copy all court rulings
from official court databases, because they are known to be riddled with
personal data. JurisWiki, a project from an experienced lawyer and free
knowledge advocate of Italy (Simone Aliprandi), had to shut down for
such issues after importing "just" 400k court rulings.
In our case, the source is an independent mirror of a government
website that is now selectively removing documents. While Wikimedia
projects have long preserved public domain government documents after
originals were taken down or censored, I am unsure how Wikisource
communities have handled this scenario in practice. Are mirrored
datasets acceptable when the original public source has been altered
or removed? How should we document provenance and authenticity for
future readers?
I would say that relying on a mirror is *better* than using an official
source, because you can have an additional layer of vetting, just like
we do with PGDP.
Are you in contact with the people in that database? Are they going to
be responsive when you find out personal data that failed to be
redacted? (This is a "when", not an "if". It's certain to happen.)
What's the added benefit that a Wikisource copy would bring to that
project? Find out, and focus on that. (Does it really need a
comprehensive copy?)
If we proceed, how should we structure this corpus so the project
remains usable? Are there recommended practices for:
– titling, metadata, and Wikidata integration for legal documents,
Wikidata should be immediately ruled out as it cannot stand this volume
of documents.
As for titles, categories etc., you should probably talk with Chinese
practitioners who can tell you how people usually search these documents.
Say the rulings are organised in tidy partitions of 100 different
provinces (I'm inventing) and people usually search within each of them,
then you can use those as prefixes and it will be easy to disambiguate.
– organizing millions of pages so they do not overwhelm categories and
search,
– mitigating strain on job queues, dumps, and indexing,
This part I would say don't worry too much about, as WMF will let you
know if it becomes a problem. Maybe don't come up with exceedingly
esoteric templates and don't rely on DynamicPageList or other extensions
known to be slow.
4. Political and archival importance
Wikisource has historically preserved documents at risk of censorship
or disappearance, whether due to authoritarian restrictions or
institutional neglect. Do other communities have experience with
politically sensitive archival projects where the preservation value
itself was a central motivation?
Yes, see above, but not at this scale.
Best,
Federico
_______________________________________________
Wikisource-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]