On Tue, Aug 22, 2017 at 5:15 PM, Gregory Ewing
wrote:
> Chris Angelico wrote:
>>
>> a naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> this one's still a mystery.
>
>
> It's unlikely that even a naive
Chris Angelico wrote:
a naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
this one's still a mystery.
It's unlikely that even a naive ascii upper/lower casing algorithm
would be *that* naive; it would hav
Ian Kelly wrote:
One possibility is that it's the same two bytes. That would make it
0xE2 0x80 0x9D which is "right double quotation mark". Since it keeps
appearing after ending double quotes that seems plausible, although
one has to wonder why it appears *in addition to* the ASCII double
quotes.
On 08/17/2017 05:53 PM, Chris Angelico wrote:
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote:
On 08/17/2017 05:14 PM, John Nagle wrote:
I'm cleaning up some data which has text description fields from
multiple sources.
A few more cases:
bytearray(b'\xe5\x81ukasz zmywaczyk')
This
Marko Rauhamaa writes:
> Chris Angelico :
>
>> Ohh. We have no evidence that uppercasing is going on here, and a
>> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> this one's still a mystery.
>
> BT
On Fri, Aug 18, 2017, at 03:39, Marko Rauhamaa wrote:
> BTW, I was reading up on the history of ASCII control characters. Quite
> fascinating.
>
> For example, have you ever wondered why DEL is the odd control character
> out at the code point 127? The reason turns out to be paper punch tape.
> By
On 2017-08-18 04:46, John Nagle wrote:
On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at
10:30 AM, John Nagle wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few
On Fri, Aug 18, 2017 at 5:39 PM, Marko Rauhamaa wrote:
> Chris Angelico :
>
>> Ohh. We have no evidence that uppercasing is going on here, and a
>> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
>> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
>> thi
Chris Angelico :
> Ohh. We have no evidence that uppercasing is going on here, and a
> naive ASCII upper-casing wouldn't produce 0x81 either - if it did, it
> would also convert 0x21 ("!") into 0x01 (SOH, a control character). So
> this one's still a mystery.
BTW, I was reading up on the history
On Fri, Aug 18, 2017 at 5:11 PM, Marko Rauhamaa wrote:
> Chris Angelico :
>
>> On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote:
>>> Chris Angelico :
>>>
On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin
wrote:
> John Nagle writes:
>> Since, as someone pointed out, there was U
Chris Angelico :
> On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote:
>> Chris Angelico :
>>
>>> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
John Nagle writes:
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algori
On Fri, Aug 18, 2017 at 4:57 PM, Marko Rauhamaa wrote:
> Chris Angelico :
>
>> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
>>> John Nagle writes:
Since, as someone pointed out, there was UTF-8 which had been
run through an ASCII-type lower casing algorithm
>>>
>>> I spent a few
Chris Angelico :
> On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
>> John Nagle writes:
>>> Since, as someone pointed out, there was UTF-8 which had been
>>> run through an ASCII-type lower casing algorithm
>>
>> I spent a few minutes figuring out if some of the mysterious 0x81's
>> could be
On Fri, Aug 18, 2017 at 4:38 PM, Paul Rubin wrote:
> John Nagle writes:
>> Since, as someone pointed out, there was UTF-8 which had been
>> run through an ASCII-type lower casing algorithm
>
> I spent a few minutes figuring out if some of the mysterious 0x81's
> could be from ASCII-lower-casing s
John Nagle writes:
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algorithm
I spent a few minutes figuring out if some of the mysterious 0x81's
could be from ASCII-lower-casing some Unicode combining characters, but
the numbers didn't seem
On Fri, Aug 18, 2017 at 4:24 PM, John Nagle wrote:
>I'm coming around to the idea that some of these snippets
> have been previously mis-converted, which is why they make no sense.
> Since, as someone pointed out, there was UTF-8 which had been
> run through an ASCII-type lower casing algorith
On 08/17/2017 10:12 PM, Ian Kelly wrote:
Here's some more 0x9d usage, each from a different data item:
Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The
Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"
This one seems like a good hint since \x99 here looks
;))
'LATIN SMALL LETTER U WITH GRAVE'
Doesn't seem too likely.
This may help:
http://i18nqa.com/debug/bug-double-conversion.html
There's always the possibility that it's just junk, or moji-bake from some other
source, so it might not be anything sensible in any extended ASCII
On Thu, Aug 17, 2017 at 9:46 PM, John Nagle wrote:
>The 0x9d thing seems unrelated to the Polish names thing. 0x9d
> shows up in the middle of English text that's otherwise ASCII.
> Is this something that can appear as a result of cutting and
> pasting from Microsoft Word?
>
>I'd like to
On 08/17/2017 05:53 PM, Chris Angelico wrote:> On Fri, Aug 18, 2017 at
10:30 AM, John Nagle wrote:
>> On 08/17/2017 05:14 PM, John Nagle wrote:
>>> I'm cleaning up some data which has text description fields from
>>> multiple sources.
>> A few more cases:
>>
>> bytearray(b'\xe5\x81ukasz zm
On Thu, Aug 17, 2017 at 8:15 PM, MRAB wrote:
> On 2017-08-18 01:53, Chris Angelico wrote:
>> So here's an insane theory: something attempted to lower-case the byte
>> stream as if it were ASCII. If you ignore the high bit, 0xC5 looks
>> like 0x45 or "E", which lower-cases by having 32 added to it,
On 2017-08-18 01:30, John Nagle wrote:
On 08/17/2017 05:14 PM, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources.
A few more cases:
bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz
On 2017-08-18 01:53, Chris Angelico wrote:
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote:
On 08/17/2017 05:14 PM, John Nagle wrote:
I'm cleaning up some data which has text description fields from
multiple sources.
A few more cases:
bytearray(b'\xe5\x81ukasz zmywaczyk')
This one
On 2017-08-18 01:14, John Nagle wrote:
I'm cleaning up some data which has text description fields from
multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
And some are in some other character set. So I have to examine and
sanity check each field in a database dump, deciding
John Nagle writes:
> I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a database dump, deciding which characte
On Thu, Aug 17, 2017 at 6:53 PM, Chris Angelico wrote:
> That doesn't work for everything, though. The 0x81 0x81 and 0x9d ones
> are still a puzzle.
I'm fairly sure that b'M\x81\x81\xfcnster' is 'Münster'. It decodes to
that in Latin-1 if you remove the \x81 bytes. The question then is
what those
On Fri, Aug 18, 2017 at 10:54 AM, Ian Kelly wrote:
> On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly wrote:
>> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote:
>>> A few more cases:
>>>
>>> bytearray(b'miguel \xe3\x81ngel santos')
>>
>> If that were b'\xc3\x81' it would be Á in UTF-8 which would fi
On Thu, Aug 17, 2017 at 6:52 PM, Ian Kelly wrote:
> On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote:
>> A few more cases:
>>
>> bytearray(b'miguel \xe3\x81ngel santos')
>
> If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
> rest of the name.
>
>> bytearray(b'\xe5\x81ukasz zmy
On Thu, Aug 17, 2017 at 6:30 PM, John Nagle wrote:
> A few more cases:
>
> bytearray(b'miguel \xe3\x81ngel santos')
If that were b'\xc3\x81' it would be Á in UTF-8 which would fit the
rest of the name.
> bytearray(b'\xe5\x81ukasz zmywaczyk')
If that were b'\xc5\x81' it would be Ł in UTF-8 which
On Fri, Aug 18, 2017 at 10:30 AM, John Nagle wrote:
> On 08/17/2017 05:14 PM, John Nagle wrote:
>> I'm cleaning up some data which has text description fields from
>> multiple sources.
> A few more cases:
>
> bytearray(b'\xe5\x81ukasz zmywaczyk')
This one has to be Polish, and the first char
On Thu, Aug 17, 2017 at 6:27 PM, Chris Angelico wrote:
> On Fri, Aug 18, 2017 at 10:14 AM, John Nagle wrote:
>> I'm cleaning up some data which has text description fields from
>> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
>> And some are in some other character set. S
On 08/17/2017 05:14 PM, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources.
A few more cases:
bytearray(b'miguel \xe3\x81ngel santos')
bytearray(b'lidija kmeti\xe4\x8d')
bytearray(b'\xe5\x81ukasz zmywaczyk')
bytearray(b'M\x81\x81\xfcnster'
On Fri, Aug 18, 2017 at 10:14 AM, John Nagle wrote:
> I'm cleaning up some data which has text description fields from
> multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
> And some are in some other character set. So I have to examine and
> sanity check each field in a databa
I'm cleaning up some data which has text description fields from
multiple sources. Some are are in UTF-8. Some are in WINDOWS-1252.
And some are in some other character set. So I have to examine and
sanity check each field in a database dump, deciding which character
set best represents what's
On 2017-01-13 05:44 PM, Grant Edwards wrote:
On 2017-01-13, D'Arcy Cain wrote:
Here is the failing code:
with open(sys.argv[1], encoding="latin-1") as fp:
for ln in fp:
print(ln)
Traceback (most recent call last):
File "./load_iff", line 11, in
print(ln)
UnicodeEncodeError:
On 2017-01-13, D'Arcy Cain wrote:
> I thought I was done with this crap once I moved to 3.x but some
> Winblows machines are still sending what some circles call "Extended
> ASCII". I have a file that I am trying to read and it is barfing on
> some characters. For
On 2017-01-13, D'Arcy Cain wrote:
> Here is the failing code:
>
> with open(sys.argv[1], encoding="latin-1") as fp:
>for ln in fp:
> print(ln)
>
> Traceback (most recent call last):
>File "./load_iff", line 11, in
> print(ln)
> UnicodeEncodeError: 'ascii' codec can't encode cha
On Fri, Jan 13, 2017, at 17:24, D'Arcy Cain wrote:
> I thought I was done with this crap once I moved to 3.x but some
> Winblows machines are still sending what some circles call "Extended
> ASCII". I have a file that I am trying to read and it is barfing on
> so
I thought I was done with this crap once I moved to 3.x but some
Winblows machines are still sending what some circles call "Extended
ASCII". I have a file that I am trying to read and it is barfing on
some characters. For example:
due to the Qu\xe9bec government
Obviously shou
even says
"There are *several* different variations of the 8-bit ASCII table."
(emphasis added), which is an understatement and a half. Wikipedia claims over
220 different "extended ASCII" encodings:
https://en.wikipedia.org/wiki/Extended_ASCII
That's more than the nu
40 matches
Mail list logo