Many of us have faced character encoding issues because we are not in control 
of our input sources and made the common assumption that UTF-8 covers 
everything.

In my lab, as an example, some of our social media posts have included ZawGyi 
Burmese character sets rather than Unicode Burmese.  (Because Myanmar developed 
technology In a closed to the world environment, they made up their own 
non-standard character set which is very common still in Mobile phones.). We 
had fully tested the app with Unicode Burmese, but honestly didn’t know ZawGyi 
was even a thing that we would see in our dataset.  We’ve also had problems 
with non-Unicode word separators in Arabic.

What we’ve found to be helpful is to view the troubling code in a hex editor 
and determine what non-standard characters may be causing the problem.

It may be some data conversion is necessary before insertion. But the first 
step is knowing WHICH characters are causing the issue.

Reply via email to