Re: multiprocessing shows no benefit

2017-10-20 Thread Thomas Nyberg
Correct me if I'm wrong, but at a high level you appear to basically
just have a mapping of strings to values and you are then shifting all
of those values by a fixed constant (in this case, `z = 5`). Why are you
using a dict at all? It would be better to use something like a numpy
array or a series from pandas. E.g. something like this without
multiprocessing:

-
import pandas as pd
from timeit import default_timer as timer

s = pd.Series(
xrange(10),
index=[str(val) for val in xrange(10)])

z = 5
start = timer()
x = s - 5
duration = float(timer() -start)
print duration, len(x), len(x) / duration
-

Then if you wanted to multiprocess it, you could basically just split
the series into num_cpu pieces and then concatenate results afterwards.

Though I do agree with others here that the operation itself is so
simple that IPC might be a drag no matter what.

Cheers,
Thomas
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: efficient way to get a sufficient set of identifying attributes

2017-10-20 Thread Robin Becker

On 19/10/2017 17:50, Stefan Ram wrote:

Robin Becker  writes:

...
this sort of makes sense for single attributes, but ignores the possibility of
combining the attributes to make the checks more discerning.


   What I wrote also applies to compound attributes
   (sets of base attributes).

   When there are n base attributes, one can form 2^n-1
   compound attributes from them, or 2^n-1-n proper compound
   attributes. Therefore, a combinatoric explosion might impede
   the brute-force approach. A heuristics might start to
   explore combinations of keys with the best s/l ratio first
   and/or use preferences for certain fields set by a human.



all good


   In database design, the keys are usually chosen by a
   human database designer using world knowledge. It sounds
   as if you want to have the computer make such a choice
   using only the information in the table as knowledge.


I think I am tending towards the chosen by real world knowledge approach :(



   Your "identifying attributes" are called "super keys"
   in database science. You probably want minimal
   identifying attribute sets (without unneeded attributes),
   which are called "candidate keys".



thanks for this and the reference below.



   So, now you can find and read literature, such as:

Journal of Al-Nahrain University
Vol.13 (2), June, 2010, pp.247-255
Science
247
Automatic Discovery Of Candidate In The Relational
 Databases Keys By Using Attributes Sets Closure
Yasmeen F. Al-ward
Department of Computer Science, College of Science,
Al-Nahrain University.

   (The title was copied by me as found, the contents is
   in the web and makes more sense than the title.)




--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


integer copy

2017-10-20 Thread ast

Hello, I tried the following:

import copy

a = 5
b = copy.copy(a)

a is b
True

I was expecting False

I am aware that it is useless to copy an integer
(or any immutable type). 


I know that for small integers, there is always a
single integer object in memory, and that for larger
one's there may have many.

a = 7
b = 7
a is b
True

a = 56543
b = 56543
a is b
False

But it seems that Python forbids to have two different
small integer objects with same value even if you
request it with:

a = 5
b = copy.copy(a)

any comments ?


--
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread ast


"ast"  a écrit dans le message de 
news:59e9b419$0$3602$426a7...@news.free.fr...

Neither works for large integers which is
even more disturbing

a = 6555443
b = copy.copy(a)
a is b

True 


--
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread Alain Ketterlin
"ast"  writes:

> "ast"  a écrit dans le message de
> news:59e9b419$0$3602$426a7...@news.free.fr...
>
> Neither works for large integers which is
> even more disturbing
>
> a = 6555443
> b = copy.copy(a)
> a is b
>
> True 

In copy.py:

| [...]
| def _copy_immutable(x):
| return x
| for t in (type(None), int, long, float, bool, str, tuple,
|   frozenset, type, xrange, types.ClassType,
|   types.BuiltinFunctionType, type(Ellipsis),
|   types.FunctionType, weakref.ref):
| d[t] = _copy_immutable
| [...]

and d[t] is what decides how to copy, via _copy_dispatch.

So none of the types listed above will be actually copied.

It should be documented more precisely. You should file a
documentation-improvement request.

Also, the doc says:

| This version does not copy types like module, class, function, method,
| nor stack trace, stack frame, nor file, socket, window, nor array, nor
| any similar types.

But who knows what "similar types" are...

-- Alain.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread Thomas Nyberg
On 10/20/2017 10:30 AM, ast wrote:
> I am aware that it is useless to copy an integer
> (or any immutable type).
> 
>   ...
> 
> any comments ?
> 
>
Why is this a problem for you?

Cheers,
Thomas
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread ast


"Thomas Nyberg"  a écrit dans le message de 
news:mailman.378.1508491267.12137.python-l...@python.org...

On 10/20/2017 10:30 AM, ast wrote:

I am aware that it is useless to copy an integer
(or any immutable type).

...

any comments ?



Why is this a problem for you?

Cheers,
Thomas


It is not. It was a test. 


--
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread Stephen Tucker
ast,

For what it's worth,

After

a = 5
b = 5
afloat = float(a)
bfloat = float(b)

afloat is bfloat

returns False.

Stephen Tucker.


On Fri, Oct 20, 2017 at 9:58 AM, ast  wrote:

>
> "ast"  a écrit dans le message de
> news:59e9b419$0$3602$426a7...@news.free.fr...
>
> Neither works for large integers which is
> even more disturbing
>
> a = 6555443
>
> b = copy.copy(a)
> a is b
>
> True
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread Thomas Jollans
On 2017-10-20 10:58, ast wrote:
> 
> "ast"  a écrit dans le message de
> news:59e9b419$0$3602$426a7...@news.free.fr...
> 
> Neither works for large integers which is
> even more disturbing
> 
> a = 6555443
> b = copy.copy(a)
> a is b
> 
> True


Why is this disturbing? As you said, it'd be completely pointless.

As to what's going on: copy.copy does not make any attempt to copy
immutable types. That's all there is to it.

Read the source if you want to know how this is done.
https://github.com/python/cpython/blob/master/Lib/copy.py#L111



-- 
Thomas Jollans
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Re: Problem with StreamReaderWriter on 3.6.3? SOLVED

2017-10-20 Thread Peter via Python-list

Thanks MRAB, your solution works a treat.

I'm replying to the list so others can know that this solution works. 
Note that sys.stderr.detach() is only available in >= 3.1, so one might 
need to do some version checking to get it to work properly in both 
versions. It also can mess up with the buffering and therefore the order 
of the output of stderr vs stdout.


Thanks again.

Peter


On 20/10/2017 10:19 AM, MRAB wrote:

On 2017-10-19 22:46, Peter via Python-list wrote:

I came across this code in Google cpplint.py, a Python script for
linting C++ code. I was getting funny results with Python 3.6.3, but it
worked fine under 2.7.13

I've tracked the problem to an issue with StreamReaderWriter; the
traceback and error never shows under 3. The _cause_ of the error is
clear (xrange not in Py3), but I need the raised exception to show.

I'm running Python 3.6.3 32bit on Windows 10. I also get the same
results on Python 3.5.2 on Ubuntu (under WSL)

I'm not super familiar with rebinding stderr with codecs, but I'm
guessing they are doing it because of some Unicode issue they may have
been having.

I have a workaround - drop the rebinding - but it seems like there might
be an error in StreamReaderWriter.
Do other people see the same behaviour?
Is there something I'm not seeing or understanding?
Would I raise it on issue tracker?

Peter

--

import sys
import codecs

sys.stderr = codecs.StreamReaderWriter(
      sys.stderr, codecs.getreader('utf8'), codecs.getwriter('utf8'),
'replace')

# This should work fine in Py 2, but raise an exception in Py3
# But instead Py3 "swallows" the exception and it is never seen
print(xrange(1, 10))

# Although this line doesn't show in Py 3 (as the script has silently
crashed)
print("This line doesn't show in Py 3")

--

StreamReaderWriter is being passed an encoder which returns bytes 
(UTF-8), but the output stream that is being passed, to which it will 
be writing those butes, i.e. the original sys.stderr, expects str.


I'd get the underlying byte stream of stderr using .detach():

sys.stderr = codecs.StreamReaderWriter(sys.stderr.detach(), 
codecs.getreader('utf8'), codecs.getwriter('utf8'), 'replace')





--
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread bartc

On 20/10/2017 10:09, Thomas Jollans wrote:


Read the source if you want to know how this is done.
https://github.com/python/cpython/blob/master/Lib/copy.py#L111


Good, informative comment block at the top of the type that you don't 
see often. Usually they concern themselves with licensing or with 
apportioning credits.


Still, copying looks pretty complicated...

--
bartc
--
https://mail.python.org/mailman/listinfo/python-list


Re: integer copy

2017-10-20 Thread Thomas Jollans
On 2017-10-20 13:17, bartc wrote:
> On 20/10/2017 10:09, Thomas Jollans wrote:
> 
>> Read the source if you want to know how this is done.
>> https://github.com/python/cpython/blob/master/Lib/copy.py#L111
> 
> Good, informative comment block at the top of the type that you don't
> see often. Usually they concern themselves with licensing or with
> apportioning credits.
> 
> Still, copying looks pretty complicated...
> 

It is usual for Python modules, especially in the stdlib, to start with
an informative docstring. But you're right, this one is more detailed
and informative than most.


-- 
Thomas Jollans
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Efficient counting of results

2017-10-20 Thread Israel Brewster
On Oct 19, 2017, at 5:18 PM, Steve D'Aprano  wrote:
> What t1 and t2 are, I have no idea. Your code there suggests that they are
> fields in your data records, but the contents of the fields, who knows?

t1 and t2 are *independent* timestamp fields. My apologies - I made the 
obviously false assumption that it was clear they were timestamps, or at least 
times based on the fact I was calculating "minutes late" based on them.

> 
> 
>> d10, w10, m10, y10, d25, w25, m25 AND y25
> 
> Try using descriptive variable names rather than these cryptic codes.

I did. In my original post I showed the table with names like "t1 1-5min". 
Granted, that's for illustration purposes, not actual code, but still, more 
descriptive. These codes were just to keep consistent with the alternative data 
format suggested :-)

> 
> I don't understand what is *actually* being computed here -- you say that t1
> is "on time" and t2 is "5 minutes late", but that's a contradiction: how can
> a single record be both on time and 5 minutes late?

Easily: because the record contains two DIFFERENT times. Since you want more 
concrete, we're talking departure and arrival times here. Quite easy to depart 
on-time, but arrive late, or depart late but arrive on-time.

> It also contradicts your statement that it is *date* and *key* that determines
> which late bin to use.

I never made such a statement. I said they are used to determine "WHAT on-time 
IS for the record", not WHETHER the record is on-time or not, and certainly not 
which late bin to use. To put it a different way, those are the key to a lookup 
table that tells me what T1 and T2 are *supposed* to be in order for *each one* 
to be on time.

So, for example, to completely make up some data (since it doesn't matter in 
the slightest), date could be 10/5/17 with a key of 42 (Let's say that is a 
driver ID to keep things concrete for you), and using those values tells me 
(via the lookup table) that on 10/5/17, 42 should have a T1 of 10:15 and a T2 
of 11:30. As we said, those would be departure and arrival times, so what we're 
saying is that on 10/5, driver #42 was *scheduled* to depart at 10:15 and 
arrive at their destination at 11:30. So if T1 was *actually* 10:14, and T2 
was, say 11:35, then I could say that T1 was on-time (actually, a minute early, 
but that doesn't matter), while T2 was 5 minutes late. Maybe traffic was 
horrible, or he had a flat. 

However, if the date changed to 9/1/17 (with the key still being 42), there 
could be a completely different schedule, with completely different "late" 
results, even if the *actual* values of t1 and t2 don't change. Maybe he was 
supposed to make the run early on that day, say 10:00-11:15, but forgot and 
left at the same time as he was used to, thereby making him 14 minutes late 
departing. and really late arriving, or something.

> Rather, it seems that date and key are irrelevant and
> can be ignored, it is only t1 and t2 which determine which late bins to
> update.

Except that then we have no way to know what t1 and t2 *should* be. You 
apparently made the assumption that t1 and t2 should always be some fixed 
value. In fact, what t1 and t2 should be varies based on date and key (see the 
driver example above, or Chris Angelico's pizza example also works well). For 
any given date, there are dozens of different keys with different expected 
values of t1 and t2 (in the pizza example Chris gave the key might be order 
number), and for any given key, the expected value of t1 and t2 could vary 
based on what date it is (say we restart order numbers from 1 each day to make 
it easy to know how many orders we've done that day, or, of course, same driver 
different day, depending on which example you prefer).

> 
> Another question: you're talking about *dates*, which implies a resolution of
> 1 day, but then you talk about records being "five minutes late" which
> implies a resolution of at least five minutes and probably one minute, if not
> seconds or milliseconds. Which is it? My guess is that you're probably
> talking about *timestamps* (datetimes) rather than *dates*.

As stated, the data has two timestamp fields T1 and T2. So yes, the resolution 
of the data is "one minute" (we ignore sub-minute timings). However (and this 
addresses your understanding below as well), I am trying to get data for the 
date, week-to-date, month-to-date, and year-to-date. So there is four different 
"date" resolution bins in addition to the "minute" resolution bins.

Perhaps a better approach to explaining is to pose the question the report is 
trying to answer:

For the given date, how many departures were on time? How many were 1-5 minutes 
late? 6-15 minutes late? What about this week: how many on-time, 1-5 minutes 
late, etc? What about this entire month (including the given date)? What about 
this year (again, including the given date and month)? How about arrivals - 
same questions. 

As you can hopefully see now, if a departure happened

Save non-pickleable variable?

2017-10-20 Thread Israel Brewster
tldr: I have an object that can't be picked. Is there any way to do a "raw" 
dump of the binary data to a file, and re-load it later?

Details: I am using a java (I know, I know - this is a python list. I'm not 
asking about the java - honest!) library (Jasper Reports) that I access from 
python using py4j (www.py4j.org ). At one point in my 
code I call a java function which, after churning on some data in a database, 
returns an object (a jasper report object populated with the final report data) 
that I can use (via another java call) to display the results in a variety of 
formats (HTML, PDF, XLS, etc). At the time I get the object back, I use it to 
display the results in HTML format for quick display, but the user may or may 
not also want to get a PDF copy in the near future. 

Since it can take some time to generate this object, and also since the data 
may change between when I do the HTML display and when the user requests a PDF 
(if they do at all), I would like to save this object for potential future 
re-use. Because it might be large, and there is actually a fairly good chance 
the user won't need it again, I'd like to save it in a temp file (tat would be 
deleted when the user logs out) rather than in memory. Unfortunately, since 
this is an object created by and returned from a java function, not a native 
python object, it is not able to be pickled (as the suggestion typically is), 
at least to my knowledge.

Given that, is there any way I can write out the "raw" binary data to a file, 
and read it back in later? Or some other way to be able to save this object? It 
is theoretically possible that I could do it on the java side, i.e. the library 
may have some way of writing out the file, but obviously I wouldn't expect 
anyone here to know anything about that - I'm just asking about the python side 
:-)

---
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Efficient counting of results

2017-10-20 Thread MRAB

On 2017-10-20 18:05, Israel Brewster wrote:[snip]

In a sense, in that it supports my initial approach.

As Stefan Ram pointed out, there is nothing wrong with the solution I have: simply using 
if statements around the calculated lateness of t1 and t2 to increment the appropriate 
counters. I was just thinking there might be tools to make the job easier/cleaner/more 
efficient. From the responses I have gotten, it would seem that that is likely not the 
case, so I'll just say "thank you all for your time", and let the matter rest.

It occurred to me that it might be more efficient to start with the 
year-to-date first.


The reasoning is that over time the number of old entries will increase, 
so if you see that a timestamp isn't in this period_of_time, then it's 
not in any smaller_period_of_time either, so you can short-circuit.


Compare doing these for an old entry:

Day first: this day? no; this week? no; this month? no; this year? no.

Year first: this year? no.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Save non-pickleable variable?

2017-10-20 Thread MRAB

On 2017-10-20 18:19, Israel Brewster wrote:

tldr: I have an object that can't be picked. Is there any way to do a "raw" 
dump of the binary data to a file, and re-load it later?

Details: I am using a java (I know, I know - this is a python list. I'm not asking 
about the java - honest!) library (Jasper Reports) that I access from python using 
py4j (www.py4j.org ). At one point in my code I call a 
java function which, after churning on some data in a database, returns an object (a 
jasper report object populated with the final report data) that I can use (via 
another java call) to display the results in a variety of formats (HTML, PDF, XLS, 
etc). At the time I get the object back, I use it to display the results in HTML 
format for quick display, but the user may or may not also want to get a PDF copy in 
the near future.

Since it can take some time to generate this object, and also since the data 
may change between when I do the HTML display and when the user requests a PDF 
(if they do at all), I would like to save this object for potential future 
re-use. Because it might be large, and there is actually a fairly good chance 
the user won't need it again, I'd like to save it in a temp file (tat would be 
deleted when the user logs out) rather than in memory. Unfortunately, since 
this is an object created by and returned from a java function, not a native 
python object, it is not able to be pickled (as the suggestion typically is), 
at least to my knowledge.

Given that, is there any way I can write out the "raw" binary data to a file, 
and read it back in later? Or some other way to be able to save this object? It is 
theoretically possible that I could do it on the java side, i.e. the library may have 
some way of writing out the file, but obviously I wouldn't expect anyone here to know 
anything about that - I'm just asking about the python side :-)

As far as I can tell, what you're getting is a Python object that's a 
proxy to the actual Java object in the Java Virtual Machine. The Python 
side might be taking to the Java side via a socket or a pipe, and not 
know anything about, or have access to, the internals of the Java object.


In fact, you can't even be sure how a particular Python object is laid 
out in memory without reading the source code.

--
https://mail.python.org/mailman/listinfo/python-list


Re: multiprocessing shows no benefit

2017-10-20 Thread Jason
Yes, it is a simplification and I am using numpy at lower layers. You correctly 
observe that it's a simple operation, but it's not a shift it's actually 
multidimensional vector algebra in numpy. So the - is more conceptual and takes 
the place of hundreds of subtractions. But the example dies demonstrate the 
complexity and how I can divide the problem up conceptually. 

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Save non-pickleable variable?

2017-10-20 Thread Israel Brewster
On Oct 20, 2017, at 11:09 AM, Stefan Ram  wrote:
> 
> Israel Brewster  writes:
>> Given that, is there any way I can write out the "raw" binary
>> data to a file
> 
>  If you can call into the Java SE library, you can try
> 
> docs.oracle.com/javase/9/docs/api/java/io/ObjectOutputStream.html#writeObject-java.lang.Object-
> 
>  , e.g.:
> 
> public static void save
> ( final java.lang.String path, final java.lang.Object object )
> { try
>  { final java.io.FileOutputStream fileOutputStream
>= new java.io.FileOutputStream( path );
> 
>final java.io.ObjectOutputStream objectOutputStream
>= new java.io.ObjectOutputStream( fileOutputStream );
> 
>objectOutputStream.writeObject( object );
> 
>objectOutputStream.close(); }
> 
>  catch( final java.io.IOException iOException )
>  { /* application-specific code */ }}
> 
> 
>> , and read it back in later?
> 
>  There's a corresponding »readObject« method in
>  »java.io.ObjectInputStream«. E.g.,
> 
> public static java.lang.Object load( final java.lang.String path )
> {
>  java.io.FileInputStream fileInputStream = null;
> 
>  java.io.ObjectInputStream objectInputStream = null;
> 
>  java.lang.Object object = null;
> 
>  try
>  { fileInputStream = new java.io.FileInputStream( path );
> 
>objectInputStream = new java.io.ObjectInputStream
>( fileInputStream );
> 
>object = objectInputStream.readObject();
> 
>objectInputStream.close(); }
> 
>  catch( final java.io.IOException iOException )
>  { java.lang.System.out.println( iOException ); }
> 
>  catch
>  ( final java.lang.ClassNotFoundException classNotFoundException )
>  { java.lang.System.out.println( classNotFoundException ); }
> 
>  return object; }
> 
>  However, it is possible that not all objects can be
>  meaningfully saved and restored in that way.

Thanks for the information. In addition to what you suggested, it may be 
possible that the Java library itself has methods for saving this object - I 
seem to recall the methods for displaying the data having options to read from 
files (rather than from the Java object directly like I'm doing), and it 
wouldn't make sense to load from a file unless you could first create said file 
by some method. I'll investigate solutions java-side.

---
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
---

> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Efficient counting of results

2017-10-20 Thread Steven D'Aprano
On Fri, 20 Oct 2017 09:05:15 -0800, Israel Brewster wrote:

> On Oct 19, 2017, at 5:18 PM, Steve D'Aprano 
> wrote:
>> What t1 and t2 are, I have no idea. Your code there suggests that they
>> are fields in your data records, but the contents of the fields, who
>> knows?
> 
> t1 and t2 are *independent* timestamp fields. My apologies - I made the
> obviously false assumption that it was clear they were timestamps, or at
> least times based on the fact I was calculating "minutes late" based on
> them.

It wasn't clear to me whether they were timestamps, flags, or something 
else. For example, you said:

"if the date of the first record was today, t1 was on-time,
and t2 was 5 minutes late"

which suggested to me that the first record is a timestamp, and t1 and t2 
were possibly enums or flags:

(date=Date(2017, 10, 21), key=key, t1=ON_TIME, t2=FIVE_MINUTES_LATE)

or possibly:

(date=Date(2017, 10, 21), key=key, t1=True, t2=False)

for example.


[...]
> Easily: because the record contains two DIFFERENT times. Since you want
> more concrete, we're talking departure and arrival times here. Quite
> easy to depart on-time, but arrive late, or depart late but arrive
> on-time.

Ah, the penny drops!

If you had called them "arrival" and "departure" instead of "t1" and 
"t2", it would have been significantly less mysterious.

Sometimes a well-chosen variable name is worth a thousand words of 
explanation.


>> It also contradicts your statement that it is *date* and *key* that
>> determines which late bin to use.
> 
> I never made such a statement. I said they are used to determine "WHAT
> on-time IS for the record", not WHETHER the record is on-time or not,
> and certainly not which late bin to use. To put it a different way,
> those are the key to a lookup table that tells me what T1 and T2 are
> *supposed* to be in order for *each one* to be on time.

Ah, that makes sense now. Thank you for explaining.


[...]
>> Rather, it seems that date and key are irrelevant and can be ignored,
>> it is only t1 and t2 which determine which late bins to update.
> 
> Except that then we have no way to know what t1 and t2 *should* be.

Yes, that makes sense now. Your example of the driver runs really helped 
clarify what you are computing.


> You
> apparently made the assumption that t1 and t2 should always be some
> fixed value.

I tried to interpret your requirements as best I could from your 
description. Sorry that I failed so badly.



[...]
> Perhaps a better approach to explaining is to pose the question the
> report is trying to answer:

That would have been helpful.

[...]
> As Stefan Ram pointed out, there is nothing wrong with the solution I
> have: simply using if statements around the calculated lateness of t1
> and t2 to increment the appropriate counters. I was just thinking there
> might be tools to make the job easier/cleaner/more efficient. From the
> responses I have gotten, it would seem that that is likely not the case,
> so I'll just say "thank you all for your time", and let the matter rest.

No problem. Sorry I couldn't be more helpful and glad you have a working 
solution.



-- 
Steven D'Aprano
-- 
https://mail.python.org/mailman/listinfo/python-list


grapheme cluster library

2017-10-20 Thread Rustom Mody
Is there a recommended library for manipulating grapheme clusters?

In particular, in devanagari
क् + ि = कि 
in (pseudo)unicode names
KA-letter + I-sign = KI-composite-letter

I would like to be able to handle KI as a letter rather than two code-points.
Can of course write an automaton to group but guessing that its already
available some place…
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiprocessing shows no benefit

2017-10-20 Thread Michele Simionato
There is a trick that I use when data transfer is the performance killer. Just 
save your big array first (for instance on and .hdf5 file) and send to the 
workers the indices to retrieve the portion of the array you are interested in 
instead of the actual subarray.

Anyway there are cases where multiprocessing will never help, since the 
operation is too fast with respect to the overhead involved in multiprocessing. 
In that case just give up and think about ways of changing the original problem.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: grapheme cluster library

2017-10-20 Thread Chris Angelico
On Sat, Oct 21, 2017 at 3:25 PM, Stefan Ram  wrote:
> Rustom Mody  writes:
>>Is there a recommended library for manipulating grapheme clusters?
>
>   The Python Library has a module "unicodedata", with functions like:
>
> |unicodedata.normalize( form, unistr )
> |
> |Returns the normal form »form« for the Unicode string »unistr«.
> |Valid values for »form« are »NFC«, »NFKC«, »NFD«, and »NFKD«.
>
>   . I don't know whether the transformation you are looking for
>   is one of those.

No, that's at a lower level than grapheme clusters.

Rustom, have you looked on PyPI? There are a couple of hits, including
one simply called "grapheme".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list