Hello everyone,

I am Quinn (User:SuperGrey) from Chinese Wikisource (zh.wikisource.org). I am 
writing to request advice and precedent from the wider Wikisource community and 
the Wikimedia Foundation regarding a proposed large-scale import of Chinese 
court judgments from the national database known as China Judgments Online 
(中国裁判文书网, often abbreviated as CJO).

I would like to begin with some background, because many non-Chinese Wikimedia 
contributors may not be aware of how significant CJO has been for judicial 
transparency in China and how sharply access to it has been reduced in recent 
years.

China Judgments Online was launched in 2014 by the Supreme People’s Court (SPC) 
as a major transparency initiative. For nearly a decade, courts across the 
country uploaded tens of millions of decisions, creating what was widely 
regarded as one of the world’s largest publicly accessible judicial databases. 
At its peak, CJO hosted over 140 million documents and received tens of 
billions of page views. Researchers inside and outside China used the site 
extensively to study judicial behavior, local governance, criminal justice, and 
institutional changes.

However, since around 2021, and especially in 2023–2024, the Chinese government 
has significantly reversed this openness. Multiple independent investigations 
and media reports have documented the systematic removal of previously public 
judgments, particularly those that reflect poorly on local authorities, expose 
procedural misconduct, involve politically sensitive issues, or contradict 
preferred political narratives. In late 2023, leaked SPC documents revealed 
instructions to migrate judgments into a new internal-only database accessible 
solely within the court system, while sharply reducing what remains publicly 
visible. Studies have shown that vast numbers of cases have already disappeared 
from public view. Major news organizations such as MIT Technology Review, Radio 
Free Asia, the South China Morning Post, and Reuters have all reported on this 
rollback of judicial transparency:
– 
https://www.technologyreview.com/2023/12/20/1085741/china-judgements-online-transparency-government/https://www.rfa.org/english/news/china/china-court-records-12142023132626.htmlhttps://www.scmp.com/news/china/politics/article/3246067/china-cut-back-access-court-rulings-sparking-concerns-about-judicial-transparencyhttps://www.reuters.com/world/china/china-vows-judicial-disclosure-after-outcry-over-plan-curb-access-rulings-2024-01-22/

For our purposes, the important point is this: CJO has removed or restricted 
access to large portions of its historical archive, including documents that 
were originally public, legally non-copyrightable under Chinese law, and 
crucial for understanding the functioning of China’s legal system. Many 
judgments that were once easily verifiable on the official site can no longer 
be checked against their original source. These documents are at risk of 
disappearing entirely from public access.

An independent archiving project, caseopen.org, has preserved a large HTML 
snapshot of CJO’s judgments spanning 2013 to October 2024. The maintainers of 
caseopen.org have donated this dataset to Chinese Wikisource. The files capture 
the “online version” as it originally appeared on CJO, including formatting and 
errors, and therefore represent a unique opportunity to preserve a historical 
record of China’s legal system prior to this wave of censorship and delisting. 
In practical terms, this may be the last comprehensive public snapshot that 
will ever exist.

On Chinese Wikisource, I have proposed importing this dataset through a bot 
(User:SuperGrey-bot). The local discussion, including technical details and 
code links, is here (in Chinese):
https://zh.wikisource.org/wiki/Wikisource:机器人#User:SuperGrey-bot

The scale of the corpus is extremely large: tens of millions of judgments, 
potentially more if we include non-judgment document types such as 裁定书 (ruling 
document) and 通知书 (notification document). We are planning a staged import, 
beginning with small test batches, then individual months, and only later the 
full corpus, once the community settles questions about formatting, titling, 
metadata, and scope.

Because this project includes politically sensitive material and an unusual 
archival value, and because the scale is unprecedented for our language 
Wikisource, I would greatly appreciate advice and precedent from the 
international community. This is not only a technical or organizational task; 
it is also a preservation effort. We are attempting to safeguard public domain 
legal documents that have been systematically removed from public access. 
Wikisource may be one of the last neutral, open, global platforms capable of 
preserving this historical record.

Given the potential size of the import, I would also appreciate input from the 
Wikimedia Foundation on any operational considerations. A multi-million–page 
import may affect storage, dumps, CirrusSearch indexing, and overall site 
performance. Before proceeding beyond small test batches, I would like to 
understand whether such an import is feasible within the current technical 
limits of Chinese Wikisource, and whether coordination with SRE or Cloud 
Services is recommended.

Specifically, I would like to ask for input on the following areas:

1. Scope and suitability  
Have other Wikisources hosted similarly massive, uniform corpora of government 
or legal documents? How did you determine whether they fit the mission of 
Wikisource? Were there concerns about overwhelming the project or changing its 
character?

2. Verifiability and provenance  
In our case, the source is an independent mirror of a government website that 
is now selectively removing documents. While Wikimedia projects have long 
preserved public domain government documents after originals were taken down or 
censored, I am unsure how Wikisource communities have handled this scenario in 
practice. Are mirrored datasets acceptable when the original public source has 
been altered or removed? How should we document provenance and authenticity for 
future readers?

3. Organizational and technical considerations  
If we proceed, how should we structure this corpus so the project remains 
usable? Are there recommended practices for:  
– titling, metadata, and Wikidata integration for legal documents,  
– organizing millions of pages so they do not overwhelm categories and search,  
– mitigating strain on job queues, dumps, and indexing,  
– making future partial deletions or corrections feasible if political pressure 
or legal demands (e.g., DMCA takedown notices) ever arise?

4. Political and archival importance  
Wikisource has historically preserved documents at risk of censorship or 
disappearance, whether due to authoritarian restrictions or institutional 
neglect. Do other communities have experience with politically sensitive 
archival projects where the preservation value itself was a central motivation?

At present, Chinese Wikisource is still deliberating basic formatting and 
policy questions. No large imports will be performed until a local consensus is 
clear. Although we are working from the independent caseopen.org snapshot 
rather than relying on ongoing availability of the official CJO site, the 
broader context is that public access to Chinese judicial decisions has already 
been substantially reduced in recent years. Because our dataset preserves a 
historical record that may not remain accessible through official channels, we 
believe this is an appropriate moment to seek broader input and learn from 
other Wikisource communities with similar archival experiences.

Thank you very much for your time, advice, and any examples or concerns you can 
share. Even understanding which questions we should be asking would be 
extremely helpful.

Best regards,
Quinn Gao (User:SuperGrey)
https://meta.wikimedia.org/wiki/User:SuperGrey
_______________________________________________
Wikisource-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to