Edit report at https://bugs.php.net/bug.php?id=64667&edit=1
ID: 64667
Comment by: mail+php at requinix dot net
Reported by: jisgro at teliae dot fr
Summary: mb_detect_encoding problem
Status: Open
Type: Bug
Package: *General Issues
Operating System: Debian Edge
PHP Version: 5.3.24
Block user comment: N
Private report: N
New Comment:
In general, and in your circumstance, mb_detect_encoding() will return the
first
encoding that matches. The requirements for "ASCII" are that the bytes are all
<0x80 and your file won't match that. Next is "ISO-8859-1" which doesn't really
have any requirements at all. And that's the problem: it will always succeed.
The array should be arranged with most exclusive (harder to validate) encodings
first and most permissive (easier to validate) last. I would start testing with:
[UTF-32, UTF-16, UTF-8, ASCII, ???]
The problem is what to fall back to. ISO 8859-1, -2, and -15 will always
succeed, and Windows-1251 and -1252 will only succeed if the entire string
consists of high-byte characters in a certain range. (Why do those two work
like
that? No clue.)
So make a choice: if a string is neither UTF-* nor simple ASCII what do you
think it probably is? You're writing code in French so I'm going to guess ISO
8859-15.
[UTF-32, UTF-16, UTF-8, ASCII, ISO-8859-15]
If you want to go beyond that then you can do some rudimentary character
analysis: some byte combinations may make sense in one encoding but not in
another. Example: about half of \xA0-\xAF bytes in ISO 8859-1/-15 are symbols
but are characters in ISO 8859-2.
Previous Comments:
------------------------------------------------------------------------
[2013-04-18 14:53:50] jisgro at teliae dot fr
sorry, the Test script is not very simple, here is a "light version":
function convertFileInUTF8($sFileName){
$sFileContent = file_get_contents($sFileName);
$tabKnownEncoding = array(
'ASCII'
,'ISO-8859-1'
,'ISO-8859-2'
,'ISO-8859-15'
,'UTF-8'
,'UTF-16'
,'UTF-32'
,'Windows-1251'
,'Windows-1252'
);
$sFormat = mb_detect_encoding($sFileContent, $tabKnownEncoding, true);
echo "Format : ".$sFormat."\n";
iconv_set_encoding("internal_encoding", "UTF-8");
$sNewContent = iconv($sFormat, 'UTF-8', $sFileContent);
//Save
file_put_contents($sFileName, $sNewContent);
ttt($sFileContent,'<->',$sNewContent);
$sFormat = mb_detect_encoding($sNewContent, $tabKnownEncoding, true);
echo "Format : ".$sFormat."\n";
------------------------------------------------------------------------
[2013-04-18 14:08:56] jisgro at teliae dot fr
Description:
------------
php 5.3.3
We open a file with ANSII encoding, we set the encoding with the
"iconv_set_encoding("internal_encoding", "UTF-8");" function to UTF8
the mb_detect_encoding return before and after the encoding : Format :
ISO-8859-1
The function is in the test script, it returns :
Format : ISO-8859-1 mystere
ééé ééé é éé à à à à à , <-> , ��� ��� � �� �
� � � � ,
Format : ISO-8859-1
Test script:
---------------
function convertirFichierEnUTF8($sNomFichier){
$sContenuFichier = file_get_contents($sNomFichier);
if($sContenuFichier == ''){//cas vide et cas erreur de lecture
return;
}
$tabFormatsReconnus = array(
'ASCII'
,'ISO-8859-1'
,'ISO-8859-2'
,'ISO-8859-15'
,'UTF-8'
,'UTF-16'
,'UTF-32'
,'Windows-1251'
,'Windows-1252'
);
$sFormat = mb_detect_encoding($sContenuFichier, $tabFormatsReconnus, true);
//echo $sNomFichier."\n";
echo "Format : ".$sFormat."\n";
if($sFormat === false){
CLog::trace('Erreur encodage du fichier '.$sNomFichier.' inconnu',
'Conversion fichier', 'Erreur détection encodage', 0,
CLog::INIVEAU_ERREUR_CRITIQUE, CConfig::$sEmail_Trace_Erreur);
return;
}
//Les formats suivants n'ont pas besoin de conversion
if(in_array($sFormat, array('UTF-8', 'ASCII'))){
return;
}
iconv_set_encoding("internal_encoding", "UTF-8");
//iconv_set_encoding("output_encoding", "UTF-8");
$sNouveauContenu = iconv($sFormat, 'UTF-8', $sContenuFichier);
//Si la conversion a eu un problème
if($sNouveauContenu === ''){
CLog::trace('Erreur à la conversion en UTF8 du fichier '.$sNomFichier,
'Conversion fichier', 'Erreur conversion UTF8', 0,
CLog::INIVEAU_ERREUR);
$sNouveauContenu = iconv($sFormat, 'UTF-8//IGNORE', $sContenuFichier);
CreeRepSiNonExiste(CConfig::$sRepertoire_log, 'erreursConversionFichiers');
file_put_contents(CConfig::$sRepertoire_log.'erreursConversionFichiers/'.basename($sNomFichier),
$sContenuFichier);
}
//On sauvegarde le résultat de la conversion
file_put_contents($sNomFichier, $sNouveauContenu);
echo ($sContenuFichier === $sContenuFichier ? 'aie aie aie c
pareil':'mystere' );
ttt($sNouveauContenu,'<->',$sNouveauContenu);
$sFormat = mb_detect_encoding($sNouveauContenu, $tabFormatsReconnus, true);
//echo $sNomFichier."\n";
echo "Format : ".$sFormat."\n";
}
Expected result:
----------------
return format in UTF8
Actual result:
--------------
Format : ISO-8859-1
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=64667&edit=1