Thanks, Barton. As Jonathan pointed out, other people noticed this far before 
me, but I’m not sure it’s really been discussed much. 

 

I’ve filed a bug at 
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=18336. 

 

I haven’t checked Zebra, but I assume that it’s OK. That’s a good question 
though. 

 

It seems that most of CJK is handled by 3 byte characters, but I was able to 
find some in the 4 byte range. Unfortunately, my knowledge of CJK languages is 
very limited. While I’ve been teaching myself a bit of Chinese for fun in my 
non-existent spare time, I haven’t really gotten to the written stage. But an 
interesting case is emoji. Nick from Bywater posted about adding support for 
emoji to Koha: https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=15794. 

 

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St

Ultimo, NSW 2007

Australia

 

Office: 02 9212 0899

Direct: 02 8005 0595

 

From: Barton Chittenden [mailto:bar...@bywatersolutions.com] 
Sent: Saturday, 25 March 2017 12:05 AM
To: David Cook <dc...@prosentient.com.au>
Cc: Koha-devel <koha-devel@lists.koha-community.org>
Subject: Re: [Koha-devel] MySQL silently truncating strings due to 3 byte limit 
in 'utf8' charset

 

Excellent catch, David!

 

Please file a bug for that, if you haven't already, the use of MARC 880 for CJK 
languages is quite common in US libraries.

 

Does Zebra handle 4 byte characters correctly?

 

--Barton

 

On Fri, Mar 24, 2017 at 2:08 AM, David Cook <dc...@prosentient.com.au 
<mailto:dc...@prosentient.com.au> > wrote:

Hi all,

 

I ran into a problem this afternoon where MySQL was silently truncating my UTF8 
encoded string in biblio_metadata.metadata and biblio.notes. Admittedly, the 
string had some funny looking data*, but it was all valid UTF8. I had zero 
problems with XML::LibXML or MARC::Record. However, whenever I would use DBI or 
the mysql command line client, I’d get a truncated string**. This meant a 
completely invalid XML record which is irretrievable from the database using 
Koha’s XML handling methods. 

 

However, I eventually found https://bugs.mysql.com/bug.php?id=53005 where they 
state that the MySQL ‘utf8’ charset has a 3 byte limit 
(https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html). That tends 
to be fine in most cases, but 𝔤 is 4 bytes (f0 9d 94 a4 
http://dev.networkerror.org/utf8/?start=120100 
<http://dev.networkerror.org/utf8/?start=120100&end=120200&cols=4&show_uni_int=on&show_uni_hex=on&show_html_ent=on&show_raw_hex=on&show_raw_bin=on>
 
&end=120200&cols=4&show_uni_int=on&show_uni_hex=on&show_html_ent=on&show_raw_hex=on&show_raw_bin=on).
 It was on this character that the truncation happened. Apparently, there is a 
4 byte UTF-8 charset called utf8mb4: 
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html. If you 
“SET NAMES = ‘utf8mb4’” for your client and set the charset for your database 
column, that will allow you to correctly insert the data. 

 

So the truncation doesn’t generate an error but it does generate a warning, 
which is much more obvious in the mysql CLI client than when using DBI:

MariaDB [devkohadcook]> update biblio set abstract = 'We introduce a new 
perspective and a generalization of spectral networks for 4d N=2" 
style="position: relative;" tabindex="0" id="MathJax-Element-1-Frame" 
class="MathJax"&gt;=2 theories of class S" style="position: relative;" 
tabindex="0" id="MathJax-Element-2-Frame" class="MathJax"&gt; associated to 
Lie algebras g=An" style="position: relative;" tabindex="0" 
id="MathJax-Element-3-Frame" class="MathJax"&gt;𝔤=An, Dn" style="position: 
relative;" tabindex="0" id="MathJax-Element-4-Frame" class="MathJax"&gt;Dn, E6" 
style="position: relative;" tabindex="0" id="MathJax-Element-5-Frame" 
class="MathJax"&gt;E6, and E7" style="position: relative;" tabindex="0" 
id="MathJax-Element-6-Frame" class="MathJax"&gt;E7. Spectral networks directly 
compute the BPS spectra of 2d theories on surface defects coupled to the 4d 
theories. A Lie algebraic interpretation of these spectra emerges naturally 
from our construction, leading to a new description of 2d-4d wall-crossing 
phenomena. Our construction also provides an efficient framework for the study 
of BPS spectra of the 4d theories. In addition, we consider novel types of 
surface defects associated with minuscule representations of g" 
style="position: relative;" tabindex="0" id="MathJax-Element-7-Frame" 
class="MathJax"&gt;𝔤.' where biblionumber = 50314;

Query OK, 0 rows affected, 1 warning (0.00 sec)

Rows matched: 1  Changed: 0  Warnings: 1

 

Because it doesn’t raise an exception, I’d say it’s silent. You have to check 
for it. Using mysql, you can send “SHOW WARNINGS” after the insert query and 
you can see something like:

+---------+------+--------------------------------------------------------------------------------+

| Level   | Code | Message                                                      
                  |

+---------+------+--------------------------------------------------------------------------------+

| Warning | 1366 | Incorrect string value: '\xF0\x9D\x94\xA4=A...' for column 
'abstract' at row 1 |

+---------+------+--------------------------------------------------------------------------------+

1 row in set (0.01 sec)

 

With DBI, it’s harder: 
http://stackoverflow.com/questions/7946655/dbi-perl-logging-mysql-warnings. 
Apparently you can use $dbh->{mysql_warning_count} to check for warnings, and 
then send “SHOW WARNINGS” via the same database handle and that should give you 
your warning result. Or you can run MySQL in “traditional” mode which treats 
warnings as errors. 

 

So this seems like an edgecase… it’s some wonky data (albeit valid UTF8 data) 
in an otherwise English record. 

 

However, according to MySQL (I haven’t verified for myself), Chinese, Japanese, 
and Korean sometimes use 4 bytes for characters. So people using these 
languages with Koha could be getting truncated data without realising that’s 
what is happening or why.

 

I’ve tried out utf8mb4 and it works great. I think the downside is that it uses 
more space for certain types of columns because it needs to allocate more space 
for that 4th byte. I don’t know whether we should consider switching from utf8 
to utf8mb4, so that we’re inclusive of CJK and comply with RFC3629 which states 
that there’s a 4 byte maximum.

 

I think it’s worth discussion though. This wouldn’t just affect libraries in 
Asia. It would affect any library which stores records with CJK characters in a 
MARC 880 “Alternate Graphic Representation” field. 

 

I’m at the end of my Friday work day, so I’m going to leave it there, but food 
for thought!

 

 

 

 

*Aforementioned strange data:

We introduce a new perspective and a generalization of spectral networks for 4d 
N=2" style="position: relative;" tabindex="0" id="MathJax-Element-1-Frame" 
class="MathJax"&gt;=2 theories of class S" style="position: relative;" 
tabindex="0" id="MathJax-Element-2-Frame" class="MathJax"&gt; associated to 
Lie algebras g=An" style="position: relative;" tabindex="0" 
id="MathJax-Element-3-Frame" class="MathJax"&gt;𝔤=An, Dn" style="position: 
relative;" tabindex="0" id="MathJax-Element-4-Frame" class="MathJax"&gt;Dn, E6" 
style="position: relative;" tabindex="0" id="MathJax-Element-5-Frame" 
class="MathJax"&gt;E6, and E7" style="position: relative;" tabindex="0" 
id="MathJax-Element-6-Frame" class="MathJax"&gt;E7. Spectral networks directly 
compute the BPS spectra of 2d theories on surface defects coupled to the 4d 
theories. A Lie algebraic interpretation of these spectra emerges naturally 
from our construction, leading to a new description of 2d-4d wall-crossing 
phenomena. Our construction also provides an efficient framework for the study 
of BPS spectra of the 4d theories. In addition, we consider novel types of 
surface defects associated with minuscule representations of g" 
style="position: relative;" tabindex="0" id="MathJax-Element-7-Frame" 
class="MathJax"&gt;𝔤.

 

** Truncated string:

We introduce a new perspective and a generalization of spectral networks for 4d 
N=2" style="position: relative;" tabindex="0" id="MathJax-Element-1-Frame" 
class="MathJax"&gt;=2 theories of class S" style="position: relative;" 
tabindex="0" id="MathJax-Element-2-Frame" class="MathJax"&gt; associated to 
Lie algebras g=An" style="position: relative;" tabindex="0" 
id="MathJax-Element-3-Frame" class="MathJax"&gt;𝔤

 

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St

Ultimo, NSW 2007

Australia

 

Office: 02 9212 0899

Direct: 02 8005 0595

 


_______________________________________________
Koha-devel mailing list
Koha-devel@lists.koha-community.org 
<mailto:Koha-devel@lists.koha-community.org> 
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

 

_______________________________________________
Koha-devel mailing list
Koha-devel@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

Reply via email to