On 2 avr, 01:43, Neil Hodgson <nhodg...@iinet.net.au> wrote:
> Mark Lawrence:
> > You've given many examples of the same type of micro benchmark, not many
> > examples of different types of benchmark.
>     Trying to work out what jmfauth is on about I found what appears to
> be a performance regression with '<' string comparisons on Windows
> 64-bit. Its around 30% slower on a 25 character string that differs in
> the last character and 70-100% on a 100 character string that differs at
> the end.
>     Can someone else please try this to see if its reproducible? Linux
> doesn't show this problem.
>  >c:\python32\python -u "charwidth.py"
> 3.2 (r32:88445, Feb 20 2011, 21:30:00) [MSC v.1500 64 bit (AMD64)]
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']176
> [0.7116295577956576, 0.7055591343157613, 0.7203483026429418]
> a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']176
> [0.7664397841378787, 0.7199902325464409, 0.713719289812504]
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']176
> [0.7341851791817691, 0.6994205901833599, 0.7106807593741005]
> a=['C:/Users/Neil/Documents/𠀀','C:/Users/Neil/Documents/𠀁']180
> [0.7346812372666784, 0.6995411113377914, 0.7064768417728411]
>  >c:\python33\python -u "charwidth.py"
> 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit
> (AMD64)]
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/z']108
> [0.9913326076446045, 0.9455845241056282, 0.9459076605341776]
> a=['C:/Users/Neil/Documents/λ','C:/Users/Neil/Documents/η']192
> [1.0472289217234318, 1.0362342484091207, 1.0197109728048384]
> a=['C:/Users/Neil/Documents/b','C:/Users/Neil/Documents/η']192
> [1.0439643704533834, 0.9878581050301687, 0.9949265834034335]
> a=['C:/Users/Neil/Documents/𠀀','C:/Users/Neil/Documents/𠀁']312
> [1.0987483965446412, 1.0130257167690004, 1.024832248526499]
>     Here is the code:
> # encoding:utf-8
> import os, sys, timeit
> print(sys.version)
> examples = [
> "a=['$b','$z']",
> "a=['$λ','$η']",
> "a=['$b','$η']",
> "a=['$\U00020000','$\U00020001']"]
> baseDir = "C:/Users/Neil/Documents/"
> #~ baseDir = "C:/Users/Neil/Documents/Visual Studio
> 2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug"
> for t in examples:
>      t = t.replace("$", baseDir)
>      # Using os.write as simple way get UTF-8 to stdout
>      os.write(sys.stdout.fileno(), t.encode("utf-8"))
>      print(sys.getsizeof(t))
>      print(timeit.repeat("a[0] < a[1]",t,number=5000000))
>      print()
>     For a more significant performance difference try replacing the
> baseDir setting with (may be wrapped):
> baseDir = "C:/Users/Neil/Documents/Visual Studio
> 2012/Projects/Sigma/QtReimplementation/HLFKBase/Win32/x64/Debug"
>     Neil


>c:\python32\pythonw -u "charwidth.py"
3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]

[0.818378092261062, 0.8180854713107406, 0.8192279926793571]

[0.8131353330542339, 0.8126985677326912, 0.8122744051977042]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
[0.8271094603211102, 0.82704053883214, 0.8265781741004083]

>Exit code: 0
>c:\Python33\pythonw -u "charwidth.py"
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]

[1.6217970707185678, 1.6279369907932706, 1.6207041728220117]

[1.5150522562729396, 1.5130369919353992, 1.5121890607025037]

a=['D:\jm\jmpy\py3app\stringbenchð €€','D:\jm\jmpy\py3app
[1.6135375194801664, 1.6117739170366434, 1.6134331526540109]

>Exit code: 0

- win7 32-bits
- The file is in utf-8
- Do not be afraid by this output, it is just a copy/paste for your
excellent editor, the coding output pane is configured to use the
- Of course and as expected, similar behaviour from a console. (Which
show, how good is you application).


Something different.

From a previous msg, on this thread.


> Sure. And over a different set of samples, it is less compact. If you
> write a lot of Latin-1, Python will use one byte per character, while
> UTF-8 will use two bytes per character.

    I think you mean writing a lot of Latin-1 characters outside
However, even people writing texts in, say, French will find that only
small proportion of their text is outside ASCII and so the cost of
is correspondingly small.

    The counter-problem is that a French document that needs to
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string.


I already explained this.
It is, how to say, a miss-understanding of Unicode. What's count,
is not the amount of non-ascii chars you have in a stream.
Relevant is the fact that every char is handled with the "same
algorithm", in that case utf-8.
Unicode takes you from the "char" up to the unicode transformated
form. Then it is a question of implementation.

This is exactly what you are doing in Scintilla (maybe without
realizing this deeply).

An editor may reflect very well the example a gave. You enter
thousand ascii chars, then - boum - as you enter a non ascii
char, your editor (assuming is uses a mechanism like the FSR),
has to internally reencode everything!



Reply via email to