Re: how to optimize the below code with a helper function
optype="set") return addLogFilename(d, LOG_DIR) def run_tool(logfile, **kw): logger.info('%s would execute with %r', logfile, kw) def addLogFilename(d, logdir): '''put the logfile name into the test case data dictionary''' for casename, args in d.items(): args['logfile'] = os.path.join(logdir, casename + '.log') return d def main(): testcases = createTestCases(LOG_DIR) get_baddr = dict() for casename, kw in testcases.items(): # -- yank the logfile name out of the dictionary, before calling func logfile = kw.pop('logfile') get_baddr[casename] = run_tool(logfile, **kw) if __name__ == '__main__': main() # -- end of file -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: how to optimize the below code with a helper function
Hello again, >(1) Any tips how I can optimize this i.e test case, should have a >helper function that all test cases call. > >(2) Also note that failure.run_tool function can have variable >number of argments how to handle this in the helper function? Here's a little example of how you could coerce your problem into a ConfigParser-style configuration file. With this example, I'd think you could also see how to create a config section called [lin_02] that contains the parameters you want for creating that object. Then, it's a new problem to figure out how to refer to that object for one of your tests. Anyway, this is just another way of answering the question of "how do I simplify this repetitive code". Good luck and enjoy, -Martin #! /usr/bin/python from __future__ import absolute_import, division, print_function import os import sys import collections from ConfigParser import SafeConfigParser as ConfigParser import logging logging.basicConfig(stream=sys.stderr, level=logging.INFO) logger = logging.getLogger(__name__) LOG_DIR = '/var/log/frobnitz' def readCfgTestCases(cfgfile): data = collections.defaultdict(dict) parser = ConfigParser() parser.read(cfgfile) for section in parser.sections(): for name, value in parser.items(section): data[section][name] = value return data def main(cfgfile): testdata = readCfgTestCases(cfgfile) for k, v in testdata.items(): print(k, v) if __name__ == '__main__': main(sys.argv[1]) # -- end of file # -- config file [test01] offset = 18 size = 4 object = inode optype = set [test02] # -- no way to capture lin=lin_02; must reproduce contents of lin_02 object = lin offset = 100 size = 5 optype = set [test100] # -- no way to capture baddr=lin_02; must reproduce contents of lin_02 object = baddr offset = 100 size = 5 optype = set -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Set type for datetime intervals
Greetings László, >I need to compare sets of datetime intervals, and make set >operations on them: intersect, union, difference etc. One element >of a set would be an interval like this: > >element ::= (start_point_in_time, end_point_in_time) >intervalset ::= { element1, element2, } > >Operations on elements: > >element1.intersect(element2) >element1.union(element2) >element1.minus(element2) > >Operations on sets: > >intervalset1.intersect(intervalset2) >intervalset1.union(intervalset2) >intervalset1.minus(intervalset2) > >Does anyone know a library that already implements these functions? Sorry to be late to the party--I applaud that you have already crafted something to attack your problem. When you first posted, there was a library that was tickling my memory, but I could not remember its (simple) name. It occurred to me this morning, after you posted your new library: https://pypi.python.org/pypi/intervaltree This handles overlapping ranges nicely and provides some tools for managing them. Before posting this, I checked that it works with datetime types, and, unsurprisingly, it does. Happy trails! -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Set type for datetime intervals
>> Sorry to be late to the party--I applaud that you have already >> crafted something to attack your problem. When you first posted, >> there was a library that was tickling my memory, but I could not >> remember its (simple) name. It occurred to me this morning, after >> you posted your new library: >> >> https://pypi.python.org/pypi/intervaltree >> >> This handles overlapping ranges nicely and provides some tools for >> managing them. Before posting this, I checked that it works with >> datetime types, and, unsurprisingly, it does. > >Thank you! It is so much better than the one I have created. >Possibly I'll delete my own module from pypi. :-) I'm glad to have been able to help, László. And, even if you don't delete your new module, you have certainly stimulated quite a discussion on the mailing list. Best regards and have a good day! -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: one-element tuples
Hello Fillmore, > Here you go: > > >>> a = '"string1"' > >>> b = '"string1","string2"' > >>> c = '"string1","string2","string3"' > >>> ea = eval(a) > >>> eb = eval(b) > >>> ec = eval(c) > >>> type(ea) ><--- HERE > >>> type(eb) > > >>> type(ec) > > > I can tell you that it exists because it bit me in the butt today... > > and mind you, I am not saying that this is wrong. I'm just saying > that it surprised me. Recently in one of these two threads on your question, people have identified why the behaviour is as it is. Below, I will add one question (about eval) and one suggestion about how to circumvent the behaviour you perceive as a language discontinuity. #1: I would not choose eval() except when there is no other solution. If you don't need eval(), it may save you some headache in the future, as well, to find an alternate way. So, can we help you choose something other than eval()? What are you trying to do with that usage? #2: Yes, but, you can outsmart Python here! Simply include a terminal comma in each case, right? In short, you can force the consuming language (Python, because you are calling eval()) to understand the string as a tuple of strings, rather than merely one string. >>> a = '"string1",' >>> ea = eval(a) >>> len(ea), type(ea) (1, ) >>> b = '"string1","string2",' >>> eb = eval(b) >>> len(eb), type(eb) (2, ) >>> c = '"string1","string2","string3",' >>> ec = eval(c) >>> len(ec), type(ec) (3, ) Good luck in your continuing Python explorations, -Martin P.S. Where do your double-quoted strings come from, anyway? -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: sys.exit(1) vs raise SystemExit vs raise
Hello all, Apologies for this post which is fundamentally, a 'me too' post, but I couldn't help but chime in here. >This is good practice, putting the mainline code into a ‘main’ >function, and keeping the ‘if __name__ == '__main__'’ block small >and obvious. > >What I prefer to do is to make the ‘main’ function accept the >command-line arguments, and return the exit status for the program:: > >def main(argv): >exit_status = EXIT_STATUS_SUCCESS >try: >parse_command_line(argv) >setup_program() >run_program() >except SystemExit as exc: >exit_status = exc.code >except Exception as exc: >logging.exception(exc) >exit_status = EXIT_STATUS_ERROR > >return exit_status > >if __name__ == '__main__': >exit_status = main(sys.argv) >sys.exit(exit_status) > >That way, the ‘main’ function is testable like any other function: >specify the command line arguments, and receive the exit status. >But the rest of the code doesn't need to know that's happening. This is only a riff or a variant of what Ben has written. Here's what I like to write: def run(argv): if program_runs_smoothly: return os.EX_OK else: # -- call logging, report to STDERR, or just raise an Exception return SOMETHING_ELSE def main(): sys.exit(run(sys.argv[1:])) if __name__ == '__main__': main() Why do I do this? * the Python program runs from CLI because [if __name__ == '__main__'] * I can use main() as an entry point with setuptools * my unit testing code can pass any argv it wants to the function run() * the run() function never calls sys.exit(), so my tests can see what WOULD have been the process exit code The only change from what Ben suggests is that, once I found os.EX_OK, I just kept on using it, instead of difining my own EXIT_SUCCESS in every program. Clearly, in my above example the contents of the run() function look strange. Usually it has more different kinds of stuff in it. Anyway, best of luck! -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Looking for feedback on weighted voting algorithm
Greetings Justin, >score = sum_of_votes/num_of_votes >votes = [(72, 4), (96, 3), (48, 2), (53, 1), (26, 4), (31, 3), (68, 2), (91, >1)] >Specifically, I'm wondering if this is a good algorithm for >weighted voting. Essentially a vote is weighted by the number of >votes it counts as. I realize that this is an extremely simple >algorithm, but I was wondering if anyone had suggestions on how to >improve it. I snipped most of your code. I don't see anything wrong with your overall approach. I will make one suggestion: watch out for DivisionByZero. try: score = sum_of_votes / num_of_votes except ZeroDivisionError: score = float('nan') In your example data, all of the weights were integers, which means that a simple mean function would work, as well, if you expanded the votes to an alternate representation: votes = [72, 72, 72, 72, 96, 96, 96, 48, 48, 53, 26, 26, 26, 26, 31, 31, 31, 68, 68, 91] But, don't bother! Your function can handle votes that have a float weight: >>> weight([(4, 1.3), (1, 1),]) 2.695652173913044 Have fun! -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: How to track files processed
Greetings, >If you are parsing files in a directory what is the best way to >record which files were actioned? > >So that if i re-parse the directory i only parse the new files in >the directory? How will you know that the files are new? If a file has exactly the same content as another file, but a different name, is it new? Often this depends on the characteristics of the system in which your (planned) software is operating. Peter Otten has also asked for some more context, which would help us give you some tips that are more targetted to the problem you are trying to solve. But, I'll just forge ahead and make some assumptions: * You are watching a directory for new/changed files. * New files are appearing regularly. * Contents of old files get updated and you want to know. Have you ever seen an MD5SUMS file? Do you know what a content hash is? You could find a place to store the content hash (a.k.a. digest) of each file that you process. Below is a program that should work in Python2 and Python3. You could use this sort of approach as part of your solution. In order to make sure you have handled a file before, you should store and compare two things. 1. The filename. 2. The content hash. Note: If you are sure the content is not going to change, then just use the filename to track whether you have handled something or not. How would you use this tracking info ? * Create a dictionary (or a set), e.g.: handled = dict() handled[('410c35da37b9a25d9b5d701753b011e5','setup.py')] = time.time() Lasts only as long as the program runs. But, you will know that you have handled any file by the tuple of its content hash and filename. * Store the filename (and/or digest) in a database. So many options: sqlite, pickle, anydbm, text file of your own crafting, SQLAlchemy ... * Create a file, hardlink or symlink in the filesystem (in the same directory or another directory), e.g.: trackingfile = os.path.join('another-directory', 'setup.py') with open(trackingfile, 'w') as f: f.write('410c35da37b9a25d9b5d701753b011e5') OR os.symlink('setup.py', '410c35da37b9a25d9b5d701753b011e5-setup.py') Now, you can also examine your little cache of handled files to compare for when the content hash changes. If the system is an automated system, then this can be perfectly fine. If humans create the files, I would suggest not doing this. Humans tend to be easily confused by such things (and then want to delete the files or just be intimidated by them; scary hashes!). There are lots of options, but without some more context, we can only make generic suggestions. So, I'll stop with my generic suggestions now. Have fun and good luck! -Martin #! /usr/bin/python from __future__ import print_function import os import sys import logging import hashlib logformat = '%(levelname)-9s %(name)s %(filename)s#%(lineno)s ' \ + '%(funcName)s %(message)s' logging.basicConfig(stream=sys.stderr, format=logformat, level=logging.ERROR) logger = logging.getLogger(__name__) def hashthatfile(fname): contenthash = hashlib.md5() try: with open(fname, 'rb') as f: contenthash.update(f.read()) return contenthash.hexdigest() except IOError as e: logger.warning("See exception below; skipping file %s", fname) logger.exception(e) return None def main(dirname): for fname in os.listdir(dirname): if not os.path.isfile(fname): logger.debug("Skipping non-file %s", fname) continue logger.info("Found file %s", fname) digest = hashthatfile(fname) logger.info("Computed MD5 hash digest %s", digest) print('%s %s' % (digest, fname,)) return os.EX_OK if __name__ == '__main__': if len(sys.argv) == 1: sys.exit(main(os.getcwd())) else: sys.exit(main(sys.argv[1])) # -- end of file -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
manpage writing [rst, asciidoc, pod] was [Re: What should Python apps do when asked to show help?]
Hello, >What is a good place where I can find out more about writing >manpage? I don't know of a single place where manpage authorship is particularly documented. This seems to be one of the common target links. In addition to introducing the breakdown of manpages by type (section) and providing some suggestions for content, it introduces the *roff markup: http://www.schweikhardt.net/man_page_howto.html It's been many years since I have written that stuff directly. I prefer one of the lightweight, general documentation or markup languages. So, below, I'll mention and give examples for creating manpages from reStructuredtext, AsciiDoc and Plain Old Documentation. With the reStructuredText format [0] [1], you can convert an .rst file to a manpage using two different document processors; you can use sphinx-build from the sphinx-project [2] or rst2man from the docutils project. The outputs are largely the same (as far as I can tell). There's also the AsciiDoc [3] format, which is a near to text and reads like text, but has a clear structure. With the tooling (written in Python), you can produce docbook, latex, html and a bunch of other output formats. Oh, and manpages [4], too. There is a tool called 'asciidoc' which processes AsciiDoc formats into a variety of backend formats. The 'a2x' tool converts AsciiDoc sources into some other (x) desired output. If you don't like .rst or AsciiDoc, there's also the Plain Old Documentation (POD) format. This is the oldest tool (of which I'm aware) which other than the direct *roff processing tools. You run 'pod2man' (written in Perl) on your .pod file. POD is another dead simple documentation language, supported by the pod2man [5] tool. For more on the format, read also 'man 1 perlpod'. sphinx-build: the sphinx documentation system is geared for handling project-scoped documentation and provides many additional features to reStructuredText. It can produce all kinds of output formats, HTML single-page, help, multipage, texinfo, latex, text, epub and oh, yeah, manpages. It's a rich set of tools. If you wish to use sphinx, I can give you an example .rst file [6] which I recently wrote and the following instructions for how to process this with sphinx. When processing docs with sphinx, a 'conf.py' file is required. It can be generated with an ancillary tool from the sphinx suite: I know that I always find an example helpful. So, here are some examples to help you launch. mkdir sampledir && cd sampledir sphinx-quickstart # -- and answer a bunch of questions # -- examine conf.py and adjust to your heart's content #confirm that master_doc is your single document for a manpage #confirm that there's an entry for your document in man_pages sphinx-build -b man -d _build/doctrees . _build/man # -- or grab the files from my recent project [6] and try yourself rst2man: even more simply, if you don't need the kitchen sink... wget https://gitlab.com/pdftools/pdfposter/raw/develop/pdfposter.rst rst2man < pdfposter.rst > pdfposter.1 # -- will complain about this, but still produces a manpage # :10: (ERROR/3) Undefined substitution referenced: "VERSION". man ./pdfposter.1 asciidoc (a randomly selected example asciidoc file [7]): wget https://raw.githubusercontent.com/DavidGamba/grepp/master/grepp.adoc a2x -f manpage grepp.adoc man ./grepp.1 perlpod: wget https://api.metacpan.org/source/RJBS/perl-5.18.1/pod/perlrun.pod pod2man --section 1 < perlrun.pod > perlrun.1 man ./perlrun.1 I know there are other tools for generating manpages; the original *roff tools, visual manpage editors, DocBook, help2man, manpage generators from argparse.ArgumentParser instances, And, of course, make sure to use version control for your documentation. These git manpages may be helpful for the uninitiated (joke, joke): https://git-man-page-generator.lokaltog.net/ # -- humour! Good luck, -Martin [0] http://docutils.sourceforge.net/docs/user/rst/quickref.html [1] http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html [2] http://www.sphinx-doc.org/en/stable/rest.html [3] http://www.methods.co.nz/asciidoc/ [4] http://www.methods.co.nz/asciidoc/chunked/ch24.html [5] http://perldoc.perl.org/pod2man.html [6] https://raw.githubusercontent.com/tLDP/python-tldp/master/docs/ldptool-man.rst [7] https://raw.githubusercontent.com/DavidGamba/grepp/master/grepp.adoc -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: redirecting stdout and stderr to /dev/null
Hello there, >I'm new to python but well versed on other languages such as C and >Perl > >I'm have problems redirecting stdout and stderr to /dev/null in a >program that does a fork and exec. T found this method googling >around and it is quite elegant compared to to the Perl version. > >So to isolate things I made a much shorter test program and it >still is not redirecting. What am I doing wrong? > >test program test.py >- cut here --- >import sys >import os > >f = open(os.devnull, 'w') >sys.stdout = f >sys.stderr = f >os.execl("/bin/ping", "", "-w", "20", "192.168.1.1"); >-- cut here --- Think about the file descriptors. Unix doesn't care what the name is, rather that the process inherits the FDs from the parent. So, your solution might need to be a bit more complicated to achieve what you desire. Run the following to see what I mean. realstdout = sys.stdout realstderr = sys.stderr f = open(os.devnull, 'w') sys.stdout = f sys.stderr = f print("realstdout FD: %d" % (realstdout.fileno(),), file=realstdout) print("realstderr FD: %d" % (realstderr.fileno(),), file=realstdout) print("sys.stdout FD: %d" % (sys.stdout.fileno(),), file=realstdout) print("sys.stderr FD: %d" % (sys.stderr.fileno(),), file=realstdout) That should produce output that looks like this: realstdout FD: 1 realstderr FD: 2 sys.stdout FD: 3 sys.stderr FD: 3 I hope that's a good hint... I like the idea of simply calling the next program using one of the exec() variants, but you'll have to adjust the file descriptors, rather than just the names used by Python. If you don't need to exec(), but just run a child, then here's the next hint (this is for Python 3.5): import subprocess cmd = ["ping", "-w", "20", "192.168.1.1"] devnull = subprocess.DEVNULL proc = subprocess.run(cmd, stdout=devnull, stderr=devnull) proc.check_returncode() (By the way, your "ping" command looked like it had an empty token in the second arg position. Looked weird to me, so I removed it in my examples.) For subprocess.run, see: https://docs.python.org/3/library/subprocess.html#subprocess.run For earlier Python versions without run(), you can use Popen(): import subprocess cmd = ["/bin/ping", "-w", "20", "192.168.1.1"] devnull = subprocess.DEVNULL proc = subprocess.Popen(cmd, stdout=devnull, stderr=devnull) retcode = proc.wait() if retcode != 0: raise FlamingHorribleDeath You will have to define FlamingHorribleDeath or figure out what you want to do in the event of the various different types of failureif you don't then, you'll just see this: NameError: name 'FlamingHorribleDeath' is not defined Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Average calculation Program *need help*
Greetings kobsx4, >Hello all, I have been struggling with this code for 3 hours now >and I'm still stumped. My problem is that when I run the following >code: >-- >#this function will get the total scores >def getScores(totalScores, number): >for counter in range(0, number): >score = input('Enter their score: ') >totalScores = totalScores + score > >while not (score >= 0 and score <= 100): > >print "Your score must be between 0 and 100." >score = input('Enter their score: ') > > > >return totalScores >-- >the program is supposed to find the average of two test scores and >if one of the scores is out of the score range (0-100), an error >message is displayed. The main problem with this is that when >someone types in a number outside of the range, it'll ask them to >enter two scores again, but ends up adding all of the scores >together (including the invalid ones) and dividing by how many >there are. Please help. Suggestion #1: -- When you are stuck on a small piece of code, set it aside (stop looking at it) and start over again; sometimes rewriting with different variable names and a clean slate helps to highlight the problem. Professional programmers will tell you that they are quite accustomed to 'throwing away' code. Don't be afraid to do it. (While you are still learning, you might want to keep the old chunk of code around to examine so that you can maybe figure out what you did wrong.) Suggestion #2: -- Put a print statement or two in the loop, so that you see how your variables are changing. For example, just before your 'while' line, maybe something like: print "score=%d totalScores=%d" % (score, totalScores,) Suggestion #3: -- Break the problem down even smaller (Rustom Mody appears to have beat me to the punch on that suggestion, so I'll just point to his email.) Hint #1: ---- What is the value of your variable totalScores each time through the loop? Does it ever get reset? Good luck with your degubbing! -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: raise None
Hi there, >>> At worst it actively misleads the user into thinking that there >>> is a bug in _validate. Is this "user" a software user or another programmer? If a software user, then some hint about why the _validate found unacceptable data might benefit the user's ability to adjust inputs to the program. If another programmer, then that person should be able to figure it out with the full trace. Probably it's not a bug in _validate, but it could be. So, it could be a disservice to the diagnostician to exempt the _validate function from suspicion. Thus, I'd want to see _validate in the stack trace. >Maybe. As I have suggested a number of times now, I'm aware that >this is just a marginal issue. > >But I think it is a real issue. I believe in beautiful tracebacks >that give you just the right amount of information, neither too >little nor two much. Debugging is hard enough with being given more >information than you need and having to decide what bits to ignore >and which are important. I agree about tracebacks that provide the right amount of information. If I were a programmer working with the code you are describingi, I would like to know in any traceback that the failed comparisons (which implement some sort of business logic or sanity checking) occurred in the _validate function. In any software system beyond the simplest, code/data tracing would be required to figure out where the bad data originated. Since Python allows us to provide ancillary text to any exception, you could always provide a fuller explanation of the validation failure. And, while you are at it, you could add the calling function name to the text to point the programmer faster toward the probable issue. Adding one optional parameter to _validate (defaulting to the caller's function name) would allow you to point the way to a diagnostician. Here's a _validate function I made up with two silly comparision tests--where a must be greater than b and both a and b must not be convertible to integers. def _validate(a, b, func=None): if not func: func = sys._getframe(1).f_code.co_name if a >= b: raise ValueError("a cannot be larger than b in " + func) if a == int(a) or b == int(b): raise TypeError("a, b must not be convertible to int in " + func) My main point is less about identifying the calling function or its calling function, but rather to observe that arbitrary text can be used. This should help the poor sap (who is, invariably, diagnosing the problem at 03:00) realize that the function _validate is not the problem. >The principle is that errors should be raised as close to their >cause as possible. If I call spam(a, b) and provide bad arguments, >the earliest I can possibly detect that is in spam. (Only spam >knows what it accepts as arguments.) Any additional levels beyond >spam (like _validate) is moving further away: > > File "spam", line 19, in this > File "spam", line 29, in that <--- where the error really lies > File "spam", line 39, in other > File "spam", line 89, in spam <--- the first place we could detect it > File "spam", line 5, in _validate <--- where we actually detect it Yes, indeed! Our stock in trade. I never liked function 'that'. I much prefer function 'this'. -Martin Q: Who is Snow White's brother? A: Egg white. Get the yolk? -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Help on return type(?)
Hello there, >> def make_cov(cov_type, n_comp, n_fea): >> mincv = 0.1 >> rand = np.random.random >> return { >> 'spherical': (mincv + mincv * np.dot(rand((n_components, 1)), >> np.ones((1, n_features ** >> 2, >> 'tied': (make_spd_matrix(n_features) >> + mincv * np.eye(n_features)), >> 'diag': (mincv + mincv * rand((n_components, n_features))) ** 2, >> 'full': np.array([(make_spd_matrix(n_features) >>+ mincv * np.eye(n_features)) >> for x in range(n_components)]) >> }[cov_type] >> >> Specifically, could you explain the meaning of >> >> { >> ...}[cov_type] >> >> to me? > >It is a dictionary lookup. { ... } sets up a dictionary with keys > >'spherical' >'tied' >'diag' >'full' > >then { ... }[cov_type] extracts one of the values depending on >whether cov_type is 'spherical', 'tied', 'diag', or 'full'. You will see that Steven has answered your question. I will add to his answer. Your original function could be improved many ways, but especially in terms of readability. Here's how I might go at improving the readability, without understanding anything about the actual computation. def make_cov_spherical(mincv, n_components, n_features): return (mincv + mincv * np.dot(np.random.random((n_components, 1)), np.ones((1, n_features ** 2 def make_cov_diag(mincv, n_components, n_features): return (mincv + mincv * np.random.random((n_components, n_features))) ** 2 def make_cov_tied(mincv, n_components, n_features): return make_spd_matrix(n_features) + mincv * np.eye(n_features) def make_cov_full(mincv, n_components, n_features): return np.array([(make_spd_matrix(n_features) + mincv * np.eye(n_features)) for x in range(n_components)]) def make_cov(cov_type, n_comp, n_fea): mincv = 0.1 dispatch_table = { 'spherical': make_cov_spherical, 'tied': make_cov_tied, 'diag': make_cov_diag, 'full': make_cov_full, } func = dispatch_table[cov_type] return func(mincv, n_comp, n_fea) Some thoughts (and reaction to the prior code): * Your originally posted code referred to n_comp and n_fea in the signature, but then used n_components and n_features in the processing lines. Did this function ever work? * Individual functions are easier to read and understand. I would find it easier to write testing code (and docstrings) for these functions, also. * The assignment of a name (rand = np.random.random) can make sense, but I think it simply shows that the original function was trying to do too many things and was hoping to save space with this shorter name for the np.random.random. Not bad, but I dropped it anyway for the sake of clarity. * Each of the above functions (which I copied nearly verbatim) could probably now be broken into one or two lines. That would make the computation even clearer. * There may be a better way to make a function dispatch table than the one I demonstrate above, but I think it makes the point nicely. * If you break the individual computations into functions, then you only run the specific computation when it's needed. In the original example, all of the computations were run AND then, one of the results was selected. It may not matter, since computers are so fast, but, practicing basic parsimony can avoid little obvious performance hazards like this. * In short, longer, but much much clearer. Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: ignoring or replacing white lines in a diff
Hello Adriaan, >Maybe someone here has a clue what is going wrong here? Any help is >appreciated. Have you tried out this tool that does precisely what you need? to do yourself? https://pypi.python.org/pypi/xmldiff I can't vouch specifically for it, am simply a user, but I know that I have used it happily in the past. (Other CLI tools, include non-Python tools, such as xmllint, which can produce a predictable, reproducible XML formatting, too.) >I'm writing a regression test for a module that generates XML. Very good. Good == Testing. >I'm using diff to compare the results with a pregenerated one from an >earlier version. [ Interesting. I can only speculate randomly about the whitespace issue. Have you examined (with the CLI tools hexdump, od or your favorite byte dumper) the two different XML outputs? ] Back to the lands of Python > cmd = ["diff", "-w", "-I '^[[:space:]]*$'", "./xml/%s.xml" % name, > "test.xml"] It looks like a quoting issue. I think you are passing the following tokens to your OS. You should be able to run your Python program under a system call tracer to see what is actually getting exec()d. I'm accustomed to using strace, but it seems that Macintosh uses dtruss. Anyway, I think your cmd is turning into this (as for as your kernel is concerned): token 1: diff token 2: -w token 3: -I '^[[:space:]]*$' token 4: ./xml/name.xml token 5: test.xml Try this (untested): > cmd = ["diff", "-w", "-I", "^[[:space:]]*$", "./xml/%s.xml" % name, > "test.xml"] But, perhaps the xmldiff module will be what you want. -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: psss...I want to move from Perl to Python
Hello, >http://www.barnesandnoble.com/w/perl-to-python-migration-martin-c-brown/1004847881?ean=9780201734881 > >Given that this was published in 2001, surely it is time for a >second edition. How many times do you think somebody migrates from Perl to Python?! ;) -Martin P.S. I was amused when I first discovered (about 15 years ago) Martin C. Brown, an author of Perl books. I am also amused to discover that he has written one on Python. Too many of us chaps named 'Martin Brown'. https://en.wikipedia.org/wiki/Radio_Active_(radio_series) the incompetent hospital-radio trained Martin Brown (Stevens) P.P.S. In case it is not utterly clear, I am not the above author. -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Exception handling for socket.error in Python 3.5/RStudio
>except socket.error as e >line 53 except socket.error as e ^ SyntaxError: invalid syntax > >I tried changing socket.error to ConnectionRefusedError. and still >got the same error. >Please tell me if the problem is with Rstudio, Python version or >the syntax. Syntax. Your code has, unfortunately, suffered a colonectomy. When you transplant a colon, it is more likely to function properly again. For example: except socket.error as e: Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Exception handling for socket.error in Python 3.5/RStudio
Hi there Shaunak, I saw your few replies to my (and Nathan's) quick identification of syntax error. More comments follow, here. >I am running this python script on R-studio. I have Python 3.5 installed on my >system. > >count = 10 >while (count > 0): >try : ># read line from file: >print(file.readline()) ># parse >parse_json(file.readline()) >count = count - 1 >except socket.error as e >print('Connection fail', e) >print(traceback.format_exc()) > ># wait for user input to end ># input("\n Press Enter to exit..."); ># close the SSLSocket, will also close the underlying socket >ssl_sock.close() > >The error I am getting is here: > >line 53 except socket.error as e ^ SyntaxError: invalid syntax > >I tried changing socket.error to ConnectionRefusedError. and still got the >same error. We were assuming that line 53 in your file is the part you pasted above. That clearly shows a syntax error (the missing colon). If, after fixing that error, you are still seeing errors, then the probable explanations are: * you are not executing the same file you are editing * there is a separate syntax error elsewhere in the file (you sent us only a fragment) Additional points: * While the word 'file' is not reserved in Python 3.x, it is in Python 2.x, so, just be careful when working with older Python versions. You could always change your variable name, but you do not need to. * When you catch the error in the above, you print the traceback information, but your loop will continue. Is that what you desired? I might suggest saving your work carefully and make sure that you are running the same code that you are working on. Then, if you are still experiencing syntax errors, study the lines that the interpreter is complaining about. And, of course, send the list an email. Best of luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Exception handling for socket.error in Python 3.5/RStudio
Hi there, >Thanks for the detailed reply. I edited, saved and opened the file >again. Still I am getting exactly the same error. > >Putting bigger chunk of code and the error again: [snipped; thanks for the larger chunk] >Error: >except socket.error as e: > ^ >SyntaxError: invalid syntax I ran your code. I see this: $ python3 shaunak.bangale.py Connecting... Connection succeeded Traceback (most recent call last): File "shaunak.bangale.py", line 23, in ssl_sock.write(bytes(initiation_command, 'UTF-8')) NameError: name 'initiation_command' is not defined Strictly speaking, I don't think you are having a Python problem. * Are you absolutely certain you are (or your IDE is) executing the same code you are writing? * How would you be able to tell? Close your IDE. Run the code on the command-line. * How much time have you taken to work out what the interpreter is telling you? Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Suggested datatype for getting latest information from log files
;: ") pprint.pprint(marblehistory) if __name__ == '__main__': import sys if len(sys.argv) > 1: count = int(sys.argv[1]) else: count = 30 marblegame(count) # -- end of file -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Make a unique filesystem path, without creating the file
Good evening/morning Ben, >> > I am unconcerned with whether there is a real filesystem entry of >> > that name; the goal entails having no filesystem activity for this. >> > I want a valid unique filesystem path, without touching the >> > filesystem. >> >> Your phrasing is ambiguous. > >The existing behaviour of ‘tempfile.mktemp’ – actually of its >internal class ‘tempfile._RandomNameSequence’ – is to generate >unpredictable, unique, valid filesystem paths that are different >each time. > >That's the behaviour I want, in a public API that exposes what >‘tempfile’ already has implemented, documented in a way that >doesn't create a scare about security. If your code is not actually touching the filesystem, then it will not be affected by the race condition identified in the tempfile.mktemp() warning anyway. So, I'm unsure of your worry. >> But if you explain in more detail why you want this filename, perhaps >> we can come up with some ideas that will help. > >The behaviour is already implemented in the standard library. What >I'm looking for is a way to use it (not re-implement it) that is >public API and isn't scolded by the library documentation. I might also suggest the (bound) method _create_tmp() on class mailbox.Maildir, which achieves roughly the same goals, but for a permanent file. Of course, that particular method also touches the filesystem. The Maildir naming approach is based on the assumptions* that time is monotonically increasing, that system nodes never share the same name and that you don't need more than 1 uniquely named file per directory per millisecond. If so, then you can use the 9 or 10 lines of that method. Good luck, -Martin * I was tempted to joke about these two guarantees, but I think that undermines my basic message. To wit, you can probably rely on this naming technique about as much as you can rely on your system clock. I'll assume that you aren't naming all of your nodes 'franklin.p.gundersnip'. -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: asyncio - run coroutine in the background
Hello there, I realize that this discussion of supporting asynchronous name lookup requests in DNS is merely a detour in this thread on asyncio, but I couldn't resist mentioning an existing tool. >> getaddrinfo is a notorious pain but I think it's just a library >> issue; an async version should be possible in principle. How >> does Twisted handle it? Does it have a version? > >In a (non-Python) program of mine, I got annoyed by synchronous >name lookups, so I hacked around it: instead of using the regular >library functions, I just do a DNS lookup directly (which can then >be event-based - send a UDP packet, get notified when a UDP packet >arrives). Downside: Ignores /etc/nsswitch.conf and /etc/hosts, and >goes straight to the name server. Upside: Is able to do its own >caching, since the DNS library gives me the TTLs, but >gethostbyname/getaddrinfo won't. Another (non-Python) DNS name lookup library that does practically the same thing (along with the shortcomingsn you mentioned, Chris: no NSS nor /etc/hosts) is the adns library. Well, it is DNS, after all. http://www.gnu.org/software/adns/ https://pypi.python.org/pypi/adns-python/1.2.1 And, there are Python bindings. I have been quite happy using the adns tools (and tools built on the Python bindings) for mass lookups (millions of DNS names). It works very nicely. Just sharing knowledge of an existing tool, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Network Simulator
>Hi...I need help to design a network simulator consisting for 5 >routers in python...Any help would be appretiated... Have you looked at existing network simulators? On two different ends of the spectrum are: Switchyard, a small network simulator intended for pedagogy https://github.com/jsommers/switchyard NS-3, the researcher's toolkit https://www.nsnam.org/ https://www.nsnam.org/wiki/Python_bindings Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: tcp networking question (CLOSE_WAIT)
>I'm new to python networking. I am waiting TCP server/client app by >using python built-in SocketServer. My problem is if client get >killed, then the tcp port will never get released, in CLOSE_WAIT I did not thoroughly review your code (other than to see that you are not using SO_REUSEADDR). This is the most likely problem. Suggestion: man 7 socket Look for SO_REUSEADDR. Then, apply what you have learned to your code. -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: tcp networking question (CLOSE_WAIT)
Hello again Ray, >> >I'm new to python networking. I am waiting TCP server/client app by >> >using python built-in SocketServer. My problem is if client get >> >killed, then the tcp port will never get released, in CLOSE_WAIT >> >> I did not thoroughly review your code (other than to see that you >> are not using SO_REUSEADDR). This is the most likely problem. >> >> Suggestion: >> >> man 7 socket >> >> Look for SO_REUSEADDR. Then, apply what you have learned to your >> code. > >it's not I can't bind the address, my problem is: server is long >run. if client die without "disconnect" then server will leak one >socket. Sorry for my trigger-happy, and incorrect reply. After so many years, I should know better than to reply without completely processing questions. Apologies. >by using the built-in thread socket server. the extra tcp port are >opened by built-in class itself. if the handler() is finish >correctly (the line with break) then this socket will get cleaned >up. but if client dies, then I am never get out from that True >loop. so the socket will keep in close_wait > >I fond the issue. it's my own stupid issue. >i did "continue" if no data received. >just break from it then it will be fine Well, I'm glad you found the issue. Best of luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: common mistakes in this simple program
tCatchTVError('str') 42 >>> altCatchTVError(dict()) -42 Interlude and recommendation As you can see, there are many possible Exceptions that can be raised when you are calling a simple builtin function, int(). Consider now what may happen when you call out to a different program; you indicated that your run() function calls out to subprocess.Popen(). There are many more possible errors that can occur, just a few that can come to my mind: * locating the program on disk * setting up the file descriptors for the child process * fork()ing and exec()ing the program * memory issues * filesystem disappears (network goes away or block device failure) Each one of these possible errors may translate to a different exception. You have been tempted to do: try: run() except: pass This means that, no matter what happens, you are going to try to keep continuing, even in the face of massive failure. To (#1) improve the safety of your program and the environments in which it operates, to (#2) improve your defensive programming posture and to (#3) avoid frustrating your own debugging at some point in the future, you would be well-advised to identify which specific exceptions you want to ignore. As you first try to improve the resilience of your program, you may not be certain which exceptions you want to catch and which represent a roadblock for your progam. This is something that usually comes with experience. To get that experience you can define your own exception (it'll never get raised unless you raise it, so do not worry). Then, create your try-except block to catch only that one. As you encounter other exception that you are certain you wish to handle, you can do something with them: class UnknownException(Exception): pass def prep_host(): """ Prepare clustering """ for cmd in ["ls -al", "touch /tmp/file1", "mkdir /tmp/dir1"]: try: if not run_cmd_and_verify(cmd, timeout=3600): logging.info("Preparing cluster failed ...") return False except (UnknownException,): pass logging.info("Preparing Cluster.Done !!!") return True Now, as you develop your program and encounter new exceptions, you can add new except clauses to the above block with appropriate handling, or (re-)raising the caught exception. Comments on shelling out to other programs and using exceptions --- Exceptions are great for catching logic errors, type errors, filesystem errors and all manner of other errors within Python programs and runtime environments. You introduce a significant complexity the moment you fork a child (calling subprocess.Popen). It is good, though, that you are testing the return code of the cmd that you pass to the run() function. Final advice: - Do not use a bare try-except. You will frustrate your own debugging and your software may end up trying to excecute code paths (or external programs, as you are doing right now) for which you were sanity checking. -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Caching function results
Greetings Pavel, > Suppose, I have some resource-intensive tasks implemented as > functions in Python. Those are called repeatedly in my program. > It's guranteed that a call with the same arguments always produces > the same return value. I want to cache the arguments and return > values and in case of repititive call immediately return the > result without doing expensive calculations. Great problem description. Thank you for being so clear. [I snipped sample code...] This is generically called memoization. > Do you like this design or maybe there's a better way with > Python's included batteries? In Python, there's an implementation available for you in the functools module. It's called lru_cache. LRU means 'Least Recently Used'. > I'd also like to limit the size of the cache (in MB) and get rid > of old cached data. Don't know how yet. You can also limit the size of the lru_cache provided by the functools module. For this function, the size is calculated by number of entries--so you will need to figure out memory size to cache entry count. Maybe others who have used functools.lru_cache can help you with how they solved the problem of mapping entry count to memory usage. Good luck, -Martin [0] https://docs.python.org/3/library/functools.html#functools.lru_cache -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Simple exercise
>>> for i in range(len(names)): >>> print (names[i],totals[i]) >> >> Always a code smell when range() and len() are combined. > > Any other way of traversing two lists in parallel? Yes. Builtin function called 'zip'. https://docs.python.org/3/library/functions.html#zip Toy example: import string alpha = string.ascii_lowercase nums = range(len(alpha)) for N, A in zip(nums, alpha): print(N, A) Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: issue with csv module (subject module name spelling correction, too)
Good afternoon Fillmore, >>>> import csv >>>> s = '"Please preserve my doublequotes"\ttext1\ttext2' >>>> reader = csv.reader([s], delimiter='\t') > How do I instruct the reader to preserve my doublequotes? Change the quoting used by the dialect on the csv reader instance: reader = csv.reader([s], delimiter='\t', quoting=csv.QUOTE_NONE) You can use the same technique for the writer. If you cannot create your particular (required) variant of csv by tuning the available parameters in the csv module's dialect control, I'd be a touch surprised, but, it is possible that your other csv readers and writers are more finicky. Did you see the parameters that are available to you for tuning how the csv module turns your csv data into records? https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters Judging from your example, you definitely want to use quoting=csv.QUOTE_NONE, because you don't want the module to do much more than split('\t'). Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Perl to Python again
Good afternoon Fillmore, > So, now I need to split a string in a way that the first element > goes into a string and the others in a list: > > while($line = ) { > >my ($s,@values) = split /\t/,$line; > > I am trying with: > > for line in sys.stdin: >s,values = line.strip().split("\t") >print(s) > > but no luck: > > ValueError: too many values to unpack (expected 2) That means that the number of items on the right hand side of the assignment (returned from the split() call) did not match the number of variables on the left hand side. > What's the elegant python way to achieve this? Are you using Python 3? s = 'a,b,c,d,e' p, *remainder = s.split(',') assert isinstance(remainder, list) Are you using Python 2? s = 'a,b,c,d,e' remainder = s.split(',') assert isinstance(remainder, list) p = remainder.pop(0) Aside from your csv question today, many of your questions could be answered by reading through the manual documenting the standard datatypes (note, I am assuming you are using Python 3). https://docs.python.org/3/library/stdtypes.html It also sounds as though you are applying your learning right away. If that's the case, you might also benefit from reading through all of the services that are provided in the standard library with Python: https://docs.python.org/3/library/ In terms of thinking Pythonically, you may benefit from: The Python Cookbook (O'Reilly) http://shop.oreilly.com/product/0636920027072.do Python Module of the Week https://pymotw.com/3/ I'm making those recommendations because I know and have used these and also because of your Perl background. Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: retrieve key of only element in a dictionary (Python 3)
>> But, I still don't understand why this works and can't puzzle it >> out. I see a sequence on the left of the assignment operator and a >> dictionary (mapping) on the right. > >When you iterate over a dictionary, you get its keys: > >scores = {"Fred": 10, "Joe": 5, "Sam": 8} >for person in scores: >print(person) > >So unpacking will give you those keys - in an arbitrary order. Of >course, you don't care about the order when there's only one. Oh, right! Clearly, it was nonintuitive (to me), even though I've written 'for k in d:' many times. A sequence on the left hand side of an assignment, will tell the right hand side to iterate. This also explains something I never quite bothered to understand completely, because it was so obviously wrong: >>> a, b = 72 TypeError: 'int' object is not iterable The sequence on the left hand side signals that it expects the result of iter(right hand side). But, iter(72) makes no sense, so Python says TypeError. I'd imagine my Python interpreter is thinking "Dude, why are you telling me to iterate over something that is so utterly not iterable. Why do I put up with these humans?" I love being able to iterate like this: for k in d: do_something_with(k) But, somehow, this surprised me: [k] = d Now that I get it, I would probably use something like the below. I find the addition of a few characters makes this assignment much clearer to me. # -- if len(d) > 1, ValueError will be raised # (key,) = d.keys() And thank you for the reply Chris, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: retrieve key of only element in a dictionary (Python 3)
OK, so ... I'll bite! >>> d = {"squib": "007"} >>> key, = d Why exactly does this work? I understand why the following three are similar and why they all work alike in this situation: key, = d (key,) = d [key] = d I also, intuitively understand that, if the dictionary d contains more than 1 key, that the above assignments would cause: ValueError: too many values to unpack But, I still don't understand why this works and can't puzzle it out. I see a sequence on the left of the assignment operator and a dictionary (mapping) on the right. I looked through the dunder methods [0], but none of them explained this, apparently, left-hand-side context-sensitive, behaviour to me. Could somebody explain? -Martin [0] for dict(), I found: __cmp__, __contains__, __delitem__, __eq__, __ge__, __getattribute__, __getitem__, __gt__, __init__, __iter__, __le__, __len__, __lt__, __ne__, __repr__, __setitem__ and __sizeof__ -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Beginner Python Help
Greetings Alan and welcome to Python, >I just started out python and I was doing a activity where im >trying to find the max and min of a list of numbers i inputted. > >This is my code.. > >num=input("Enter list of numbers") >list1=(num.split()) > >maxim= (max(list1)) >minim= (min(list1)) > >print(minim, maxim) > >So the problem is that when I enter numbers with an uneven amount >of digits (e.g. I enter 400 20 36 85 100) I do not get 400 as the >maximum nor 20 as the minimum. What have I done wrong in the code? I will make a few points, as will probably a few others who read your posting. * [to answer your question] the builtin function called input [0] returns a string, but you are trying to get the min() and max() of numbers; therefore you must convert your strings to numbers You can determine if Python thinks the variable is a string or a number in two ways (the interactive prompt is a good place to toy with these things). Let's look at a string: >>> s = '200 elephants' >>> type(s) # what type is s? # oh! it's a string >>> s # what's in s? '200 elephants' # value in quotation marks! The quotation marks are your clue that this is a string, not a number; in addition to seeing the type. OK, so what about a number, then? (Of course, there are different kinds of numbers, complex, real, float...but I'll stick with an integer here.) >>> n = 42 >>> type(n) # what type is n? # ah, it's an int (integer) >>> n # what's in n? 42 # the value * Now, perhaps clearer? max(['400', '20', '36', '85', '100']) is sorting your list of strings lexicographically instead of numerically (as numbers); in the same way that the string 'rabbit' sorts later than 'elephant', so too does '85' sort later than '400' * it is not illegal syntax to use parentheses as you have, but you are using too many in your assignment lines; I'd recommend dropping that habit before you start; learn when parentheses are useful (creating tuples, calling functions, clarifying precedence); do not use them here: list1 = (num.split()) # -- extraneous and possibly confusing list1 = num.split()# -- just right * also, there is also Tutor mailing list [1] devoted to helping with Python language acquisition (discussions on this main list can sometimes be more involved than many beginners wish to read) I notice that you received several answers already, but I'll finish this reply and put your sample program back together for you: num = input("Enter list of numbers: ") list1 = list(map(int, num.split())) print(list1) maxim = max(list1) minim = min(list1) print(minim, maxim) You may notice that map [2] function in there. If you don't understand it, after reading the function description, I'd give you this example for loop that produces the same outcome. list1 = list() for n in num.split(): list1.append(int(n)) The map function is quite useful, so it's a good one to learn early. Good luck, -Martin [0] https://docs.python.org/3/library/functions.html#input [1] https://mail.python.org/mailman/listinfo/tutor/ [2] https://docs.python.org/3/library/functions.html#map -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Path when reading an external file
Greetings, > In a program "code.py" I read an external file "foo.txt" supposed > to be located in the same directory that "code.py" > > python/src/code.py > python/src/foo.txt > > In "code.py": f = open('foo.txt', 'r') > > But if I run "python code.py" in an other dir than src/ say in > python/, it will not work because file "foo.txt" will be searched > in dir python/ and not in dir python/src/ > > I think it is possible to build an absolute path for "foo.txt" > using __file__ so that the program works wherever you launch > "python code.py" > > Is it the correct way to handle this problem ? Ayup, I would say so. My suggested technique: here = os.path.dirname(os.path.abspath(__file__)) foo = os.path.join(here, 'foo.txt') with open(foo, 'r') as f: pass Good luck, -Martin -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Most space-efficient way to store log entries
Hello Marc, I think you have gotten quite a few answers already, but I'll add my voice. > I'm writting an application that saves historical state in a log > file. If I were in your shoes, I'd probably use the logging module rather than saving state in my own log file. That allows the application to send all historical state to the system log. Then, it could be captured, recorded, analyzed and purged (or neglected) along with all of the other logging. But, this may not be appropriate for your setup. See also my final two questions at the bottom. > I want to be really efficient in terms of used bytes. It is good to want to be efficient. Don't cost your (future) self or some other poor schlub future working or computational efficiency, though! Somebody may one day want to extract utility out of the application's log data. So, don't make that data too hard to read. > What I'm doing now is: > > 1) First use zlib.compress ... assuming you are going to write your own files, then, certainly. If you also want better compression (quantified in a table below) at a higher CPU cost, try bz2 or lzma (Python3). Note that there is not a symmetric CPU cost for compression and decompression. Usually, decompression is much cheaper. # compress = bz2.compress # compress = lzma.compress compress = zlib.compress To read the logging data, then the programmer, application analyst or sysadmin will need to spend CPU to uncompress. If it's rare, that's probably a good tradeoff. Here's my small comparison matrix of the time it takes to transform a sample log file that was roughly 33MB (in memory, no I/O costs included in timing data). The chart also shows the size of the compressed data, in bytes and percentage (to demonstrate compression efficiency). formatbytes pct walltime raw34311602 1.00% 0.0s base64-encode 46350762 1.35% 0.43066s zlib-compress 3585508 0.10% 0.54773s bz2-compress2704835 0.08% 4.15996s lzma-compress 2243172 0.07% 15.89323s base64-decode 34311602 1.00% 0.18933s bz2-decompress 34311602 1.00% 0.62733s lzma-decompress34311602 1.00% 0.22761s zlib-decompress34311602 1.00% 0.07396s The point of a sample matrix like this is to examine the tradeoff between time (for compression and decompression) and to think about how often you, your application or your users will decompress the historical data. Also consider exactly how sensitive you are to bytes on disk. (N.B. Data from a single run of the code.) Finally, simply make a choice for one of the compression algorithms. > 2) And then remove all new lines using binascii.b2a_base64, so I > have a log entry per line. I'd also suggest that you resist the base64 temptation. As others have pointed out, there's a benefit to keeping the logs compressed using one of the standard compression tools (zgrep, zcat, bzgrep, lzmagrep, xzgrep, etc.) Also, see the statistics above for proof--base64 encoding is not compression. Rather, it usually expands input data to the tune of one third (see above, the base64 encoded string is 135% of the raw input). That's not compression. So, don't do it. In this case, it's expansion and obfuscation. If you don't need it, don't choose it. In short, base64 is actively preventing you from shrinking your storage requirement. > but b2a_base64 is far from ideal: adds lots of bytes to the > compressed log entry. So, I wonder if perhaps there is a better > way to remove new lines from the zlib output? or maybe a different > approach? Suggestion: Don't worry about the single-byte newline terminator. Look at a whole logfile and choose your best option. Lastly, I have one other pair of questions for you to consider. Question one: Will your application later read or use the logging data? If no, and it is intended only as a record for posterity, then, I'd suggest sending that data to the system logs (see the 'logging' module and talk to your operational people). If yes, then question two is: What about resilience? Suppose your application crashes in the middle of writing a (compressed) logfile. What does it do? Does it open the same file? (My personal answer is always 'no.') Does it open a new file? When reading the older logfiles, how does it know where to resume? Perhaps you can see my line of thinking. Anyway, best of luck, -Martin P.S. The exact compression ratio is dependent on the input. I have rarely seen zlib at 10% or bz2 at 8%. I conclude that my sample log data must have been more homogeneous than the data on which I derived my mental bookmarks for textual compression efficiencies of around 15% for zlib and 12% for