How to remove subset from a file efficiently?
Hi all, I have two files: - PSP320.dat (quite a large list of mobile numbers), - CBR319.dat (a subset of the above, a list of barred bumbers) # head PSP320.dat CBR319.dat ==> PSP320.dat <== 96653696338 96653766996 96654609431 96654722608 96654738074 96655697044 96655824738 96656190117 96656256762 96656263751 ==> CBR319.dat <== 96651131135 96651131135 96651420412 96651730095 96652399117 96652399142 96652399142 96652399142 96652399160 96652399271 Objective: to remove the numbers present in barred-list from the PSPfile. $ ls -lh PSP320.dat CBR319.dat ... 56M Dec 28 19:41 PSP320.dat ... 8.6M Dec 28 19:40 CBR319.dat $ wc -l PSP320.dat CBR319.dat 4,462,603 PSP320.dat 693,585 CBR319.dat I wrote the following in python to do it: #: c01:rmcommon.py barredlist = open(r'/home/sjd/python/wip/CBR319.dat', 'r') postlist = open(r'/home/sjd/python/wip/PSP320.dat', 'r') outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w') # reading it all in one go, so as to avoid frequent disk accesses (assume machine has plenty memory) barredlist.read() postlist.read() # for number in postlist: if number in barrlist: pass else: outfile.write(number) barredlist.close(); postlist.close(); outfile.close() #:~ The above code simply takes too long to complete. If I were to do a diff -y PSP320.dat CBR319.dat, catch the '<' & clean it up with sed -e 's/\([0-9]*\) * PSP-CBR.dat it takes <4 minutes to complete. I wrote the following in bash to do the same: #!/bin/bash ARGS=2 if [ $# -ne $ARGS ] # takes two arguments then echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}" echo; echo "eg.: `basename $0` PSP320.dat CBR319.dat"; echo; echo "NOTE: first argument: PSP file, second: CBR file"; echo " this script _does_ no_ input validation!" exit 1 fi; # fix prefix; cost: 12.587 secs cat $1 | sed -e 's/^0*/966/' > $1.good cat $2 | sed -e 's/^0*/966/' > $2.good # sort/save files; for the 4,462,603 lines, cost: 36.589 secs sort $1.good > $1.sorted sort $2.good > $2.sorted # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs diff -y $1.sorted $2.sorted | grep "<" > $1.filtered # remove trailing junk [spaces & <]; cost: 1 min 3 secs cat $1.filtered | sed -e 's/\([0-9]*\) * $1.cleaned # remove intermediate files, good, sorted, filtered rm -f *.good *.sorted *.filtered #:~ ...but strangely though, there's a discrepancy, the reason for which I can't figure out! Needless to say, I'm utterly new to python and my programming skills & know-how are rudimentary. Any help will be genuinely appreciated. -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
The code it down to 5 lines! #!/usr/bin/python barred = set(open('/home/sajid/python/wip/CBR319.dat')) postpaid_file = open('/home/sajid/python/wip/PSP320.dat') outfile = open('/home/sajid/python/wip/PSP-CBR.dat', 'w') outfile.writelines(number for number in postpaid_file if number not in barred) postpaid_file.close(); outfile.close() Awesome! (-: Thanks a ton Fredrik, Steve. $ time ./cleanup.py real0m11.048s user0m5.232s sys 0m0.584s But there seem to be that discrepancy; will chk and update back here. Thank you all once again. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
$ time fgrep -x -v -f CBR333 PSP333 > PSP-CBR.dat.fgrep real0m31.551s user0m16.841s sys 0m0.912s -- $ time ./cleanup.py real0m6.080s user0m4.836s sys 0m0.408s -- $ wc -l PSP-CBR.dat.fgrep PSP-CBR.dat.python 3872421 PSP-CBR.dat.fgrep 3872421 PSP-CBR.dat.python Fantastic, at any rate the time is down from my initial ~4 min.! Thank you Chris. The fgrep approach is clean and to the point; and one more reason to love the *nix approach to handling everyday problems. Fredrik's set|dict approach in Python above gives me one more reason to love Python. And it is indeed fast, 5x! Thank you all for all your help. -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
$ cat cleanup_ray.py #!/usr/bin/python import itertools b = set(file('/home/sajid/python/wip/stc/2/CBR333')) file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfalse(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP333'))) -- $ time ./cleanup_ray.py real0m5.451s user0m4.496s sys 0m0.428s (-: Damn! That saves a bit more time! Bravo! Thanks to you Raymond. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
-- $ ./cleanup.py Traceback (most recent call last): File "./cleanup.py", line 3, in ? import itertools ImportError: No module named itertools -- $ time ./cleanup.py File "./cleanup.py", line 8 outfile.writelines(number for number in postpaid_file if number not in barred) ^ SyntaxError: invalid syntax The earlier results I posted were run on my workstation which has Python 2.4.1, $ uname -a && python -V Linux sajid 2.6.13-15.7-smp #1 SMP Tue Nov 29 14:32:29 UTC 2005 i686 i686 i386 GNU/Linux Python 2.4.1 but the server on which the actual processing will be done has an older version )-: $ uname -a && python -V Linux cactus 2.4.21-20.ELsmp #1 SMP Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux Python 2.2.3 Is a rewrite possible of Raymond's or Fredrik's suggestions above which will still give me the time saving made? -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
[bonono] > Have you tried the explicit loop variant with psyco ? Sure I wouldn't mind trying; can you suggest some code snippets along the lines of which I should try...? [fynali] > Needless to say, I'm utterly new to python and my programming > skills & know-how are rudimentary. (-: -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
$ cat cleanup.py #!/usr/bin/python postpaid_file = open('/home/oracle/stc/test/PSP333') outfile = open('/home/oracle/stc/test/PSP-CBR.dat', 'w') barred = {} for number in open('/home/oracle/stc/test/CBR333'): barred[number] = None # just add it as a key outfile.writelines([number for number in postpaid_file if number not in barred]) postpaid_file.close(); outfile.close() -- $ time ./cleanup.py real0m31.007s user0m24.660s sys 0m3.550s Can we say that using generators & newer Python _is_ faster? -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
$ cat cleanup_use_psyco_and_list_compr.py #!/usr/bin/python import psyco psyco.full() postpaid_file = open('/home/sajid/python/wip/stc/2/PSP333') outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco', 'w') barred = {} for number in open('/home/sajid/python/wip/stc/2/CBR333'): barred[number] = None # just add it as a key outfile.writelines([number for number in postpaid_file if number not in barred]) postpaid_file.close(); outfile.close() -- $ time ./cleanup_use_psyco_and_list_compr.py real0m39.638s user0m5.532s sys 0m0.868s This was run on my machine (w/ Python 2.4.1), can't install psyco on the actual server at the moment. I guess using generators & newer Python is indeed faster|better. -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
$ cat cleanup_use_psyco_and_list_compr.py #!/usr/bin/python import psyco psyco.full() postpaid_file = open('/home/sajid/python/wip/stc/2/PSP333') outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco', 'w') barred = {} for number in open('/home/sajid/python/wip/stc/2/CBR333'): barred[number] = None # just add it as a key for number in postpaid_file: if number not in barred: outfile.writelines(number) postpaid_file.close(); outfile.close() -- $ time ./cleanup_use_psyco_and_list_compr.py real0m24.293s user0m22.633s sys 0m0.524s Saves ~6 secs. -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
Sorry, pls read that ~15 secs. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to remove subset from a file efficiently?
$ cat cleanup_use_psyco_and_list_compr.py #!/usr/bin/python #import psyco #psyco.full() postpaid_file = open('/home/sajid/python/wip/stc/2/PSP333') outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco', 'w') barred = {} for number in open('/home/sajid/python/wip/stc/2/CBR333'): barred[number] = None # just add it as a key for number in postpaid_file: if number not in barred: outfile.writelines(number) postpaid_file.close(); outfile.close() -- $ time ./cleanup_use_psyco_and_list_compr.py real0m22.587s user0m21.653s sys 0m0.440s Not using psyco is faster! -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: how do "real" python programmers work?
Love it. -- fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: Fredrik Lundh [was "Re: explicit self revisited"]
> You idiot. Putting the word "official" in front of something doesn't > mean it can't be FUD. Especially when it is written by people such as > yourself. Have you not paid attention to anything happening in > politics around the world during your lifetime? Ridiculous boo-llshit! -- http://mail.python.org/mailman/listinfo/python-list
Re: 2**2**2**2**2 wrong? Bug?
> > 19729 > > Did you count the 'L'? > (-: -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I remotely access Scheduled Tasks from Windows XP to Windows Server 2003?
On Jul 10, 4:51 am, kj7ny <[EMAIL PROTECTED]> wrote: > On Jun 30, 10:55 am, "Roger Upole" <[EMAIL PROTECTED]> wrote: > > > > > > > "kj7ny" wrote: > > > How can I access and manipulateScheduledTasksin Windows using > > > Python? > > > > I have a Windows XP workstation running Python 2.4.4 using the > > > win32all modules to control the windows services on multiple Windows > > > 2003 servers. It works great. > > > > However, I also need to remotely collect the settings for the > > >scheduledtasks(on those same Windows 2003 servers) and then > > > manipulate those task settings. > > > > At the very least, I need to find out which ones are enabled and then > > > be able to disable and re-enable thosetasksat will. It would be > > > better to be able to also detect the account each task runs as so that > > > I could only disable selectedtasks, but I'll any help I can get. > > > > Thanks, > > > Pywin32 comes with a module that lets you do this, win32com.taskscheduler. > > You can use PyITaskScheduler.SetTargetComputer to accesstaskson remote > > machines. > > >Roger > > > == Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet > > News==http://www.newsfeeds.comThe#1 Newsgroup Service in the World! > > >100,000 Newsgroups > > ---= East/West-Coast Server Farms - Total Privacy via Encryption =--- > > I FINALLY found taskscheduler (with the help of your post). I found > it under > > ...\Python243\Lib\site-packages\win32comext\taskscheduler > > ... and, there seems to be a /test/ directory with some examples. > Haven't tried them yet, but they should get me started. > > Thanks,- Hide quoted text - > > - Show quoted text - kj7ny, could you post back here to learn from? Thanks. s|a fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: win32com ppt embedded object
On Jul 10, 8:40 pm, Lance Hoffmeyer <[EMAIL PROTECTED]> wrote: > Hey all, > > I am trying to create some python code to edit embedded ppt slides and need > some help. > > import win32com.client > from win32com.client import constants > import re > import codecs,win32com.client > import time > import datetime > import win32com.client.dynamic > ## > VARIOUS VARIABLES TO SET > path = "C:\temp/" > ## > ## > > PPT=win32com.client.Dispatch("PowerPoint.Application") > WB=PPT.Presentations.Open(path + "File.ppt") > PPT.Visible=1 > PPTSLIDE= 29 > > for Z in WB.Slides(29).Shapes: > if (Z.Type== 7): > ZZ=Z.OLEFormat.Object > WSHEET = ZZ.Worksheets(1) > WSHEET.Range("A1").Value = .50 > WSHEET.Range("A1").NumberFormat="0%" > > Gives error: > > Traceback (most recent call last): > File "P:\Burke\TRACKERS\Ortho-McNeil\04 2007, 04-10 WAVE > 4\Automation\Document1.py", line 23, in ? > WSHEET = ZZ.Worksheets(1) > File "C:\Program > Files\Python\lib\site-packages\win32com\client\dynamic.py", line 489, in > __getattr__ > raise AttributeError, "%s.%s" % (self._username_, attr) > AttributeError: .Worksheets > > Tool completed with exit code 1 > > Why is ZZ unknown and how to I correct this? > > Thanks in advance, > > Lance """ How do I know which methods and properties are available? Good question. This is hard! You need to use the documentation with the products, or possibly a COM browser. Note however that COM browsers typically rely on these objects registering themselves in certain ways, and many objects to not do this. You are just expected to know. The Python COM browser PythonCOM comes with a basic COM browser that may show you the information you need. Note that this package requires Pythonwin (ie, the MFC GUI environment) to be installed for this to work. There are far better COM browsers available - I tend to use the one that comes with MSVC, or this one! To run the browser, simply select it from the Pythonwin Tools menu, or double-click on the file win32com\client\combrowse.py """ -- s|a fynali -- http://mail.python.org/mailman/listinfo/python-list
How to programmatically insert pages into MDI.
Hi, this query is regarding automating page insertions in Microsoft Document Imaging. I have two sets of MDIs generated fortnightly: Invoices and their corresponding Broadcast Certificates; about 150 of each. My billing application can generate one big MDI with all 150 invoices and another with all the Broadcast Certificates. At the moment I take the pages of, say invoice #1 and insert them into a new MDI (calling it in001.mdi). Then I grab the corresponding Broadcast-Certificate- pages of invoice #1 and insert them into in001.mdi. I proceed to complete the rest the same way (quite a pain). Once done, I print each inx.mdi, setting various printer options such as binding direction, staple & hole-punch etc. What I would like to automate is the coupling of an Invoice & its corresponding Broadcast-Certificate-pages into a new appropriately named MDI & then printing it into one step; iterating over all the x until done. My billing app can be set to generate each invoice & broadcast certificate separately with a convenient naming convention to aid in program logic (for eg. inx.mdi & bcx.mdi) where x indicates corresponding invoice & respective broadcast certificates. All help and advice will be most appreciated. Thank you. s|a fynali -- http://mail.python.org/mailman/listinfo/python-list
Re: How to programmatically insert pages into MDI.
On Jul 24, 4:36 pm, fynali iladijas <[EMAIL PROTECTED]> wrote: > Hi, this query is regarding automating page insertions in Microsoft > Document Imaging. > > I have two sets of MDIs generated fortnightly: Invoices and their > corresponding Broadcast Certificates; about 150 of each. > > My billing application can generate one big MDI with all 150 invoices > and another with all the Broadcast Certificates. At the moment I take > the pages of, say invoice #1 and insert them into a new MDI (calling > it in001.mdi). Then I grab the corresponding Broadcast-Certificate- > pages of invoice #1 and insert them into in001.mdi. I proceed to > complete the rest the same way (quite a pain). > > Once done, I print each inx.mdi, setting various printer options > such as binding direction, staple & hole-punch etc. > > What I would like to automate is the coupling of an Invoice & its > corresponding Broadcast-Certificate-pages into a new appropriately > named MDI & then printing it into one step; iterating over all the > x until done. > > My billing app can be set to generate each invoice & broadcast > certificate separately with a convenient naming convention to aid in > program logic (for eg. inx.mdi & bcx.mdi) where x > indicates corresponding invoice & respective broadcast certificates. > > All help and advice will be most appreciated. > > Thank you. > > s|a fynali )-: -- s|a fynali -- http://mail.python.org/mailman/listinfo/python-list