[REAL SUBJECT: Beats me] Steven, I often find that I try to make a main point ad people then focus on something else, like an example.
So, do we agree on the main point that choosing a specific data structure or algorithm (or even computer language) too soon can lead to problems that can be avoided if we first map out the problem and understand it better? So when someone asks a question here like how do I get a data structure to do this, often the answer might include asking if that is the right data structure for the job. And yes, there may be no RIGHT as compared to degrees of fit that include subjective components. The rest of your post really is a rehash of the periodic efficiency discussions. I do not concede that efficiency can be ignored because computers are fast. I do concede that it is often not worth the effort or that you can inadvertently make things worse and there are tradeoffs. Let me be specific. The side topic was asking how to get a random key from an existing dictionary. If you do this ONCE, it may be no big deal to make a list of all keys, index it by a random number, and move on. I did supply a solution that might(or might not) run faster by using a generator to get one item at a time and stopping when found. Less space but not sure if less time. But what I often need to do is to segment lots of data into two piles. One is for training purposes using some machine learning algorithm and the remainder is to be used for verifications. The choice must be random or the entire project may become meaningless. So if your data structure was a dictionary with key names promptly abandoned, you cannot just call pop() umpteen times to get supposedly random results as they may come in a very specific order. If you want to have 75% of the data in the training section, and 25% reserved, and you have millions of records, what is a good way to go? Worse, some algorithms like this build forests and take numerous random samples for each tree as they go along. So you can wonder if the dictionary is the best way to go if there is no faster way to get random records. If you are told that while doing this you also need to allow simultaneous access to the data structure so new key/value pairs are being inserted or modified or removed, that may make you wonder if other choices would perhaps be best overall. One quite obvious solution is to not create a full list each and every time you want one random number. There are many ways to arrange either to have the list of keys standing around all the time or to be created and saved on the first need and kept around as long as needed. A simple generator that stays around can do that. So asking for 2 million keys one after another can slow things down if you do it from scratch each time even as garbage collection works overtime. But other solutions might simply make a copy of the dictionary as a data frame or whatever it takes. There may be merits to each solution as well as drawbacks. In any case, spending too much time on it may not be worthwhile. And as is often noted, using something already in existence and often already highly optimized and debugged, is often a deciding factor. I believe I mentioned that the implementation that creates a list from a generator may be faster than your own algorithm that may appear to be more efficient in some way. My earlier mention is an example. I agree that the implementation of lists usually results in an indexed search for the nth item and is rapid. I was referring to algorithms that generate one item at a time and find it on average halfway through and in the worst case end up reading the entire list. In theory they can be faster than creating the entire list at once but in practice, maybe not, and in any case, maybe not the best place to optimize. I tend to avoid extremes so let me make one thing very clear. I neither like it when people make snap decisions about how to do something with software and refuse to reconsider when reality intervenes NOR do I like it when they need endless (often fairly meaningless) steps to plan it all out so they may never actually start it. I like a prototyping approach with flexibility. Do some planning, try some things out, perhaps with a set of tools that lets you put a skeleton of the idea together to see if it fits. If not, adjust. When it looks reasonable, flesh it out including lots more rigor and bulletproofing. For this forum, the questions that come in range from people who have no idea how to even start a project to those that are well along and get stuck at some point. The former may need help in making a general plan of attack. The latter may just need someone to point to an error or design flaw. One size does not fit all. And, as Steven points out, there are tradeoffs. For a student exercise intended to learn how to use a feature of the language, it is not that important that the function they write checks to make sure they were called with sensible arguments like a positive number when asked how many ice cream scoops they want. For production code, much more may be required. And extremely clever solutions being offered that may not even be understood, often defeat the purpose. -----Original Message----- From: Tutor <tutor-bounces+avigross=verizon....@python.org> On Behalf Of Steven D'Aprano Sent: Wednesday, December 26, 2018 12:45 AM To: tutor@python.org Subject: Re: [Tutor] decomposing a problem On Tue, Dec 25, 2018 at 11:56:21PM -0500, Avi Gross wrote: > I find that many people are fairly uncomfortable with abstraction and > tend to resist a pure top down approach by diving to any solutions > they may envision. https://blog.codinghorror.com/it-came-from-planet-architecture/ > As someone asked on another python list, is there a better way to get > a random key for a dictionary. Well, not easily without expanding all > keys into a list of perhaps huge length. Define "better". What do you value? Time, space, simplicity or something else? One of the most harmful things to value is "cleverness" for its own sake. Some people tend to value a "clever" solution even when it wastes time, space and is over complex and therefore hard to maintain or debug. Even when the practical response to the "clever" solution is "YAGNI". What counts as "huge"? To me, picking a random key from a list of 100 keys is "huge". Copy out 100 keys to a list by hand and then pick one? What a PITA that would be. But to your computer, chances are that ten million keys is "small". One hundred million might be pushing "largish". A billion, or perhaps ten billion, could be "large". Fifty, a hundred, maybe even a thousand billion (a trillion) would be "huge". Unless you expect to be handling at least a billion keys, there's probably no justification for anything more complex than: random.choose(list(dict.keys()) Chances are that it will be faster *and* use less memory than any clever solution you come up with -- and even if it does use more memory, it uses it for a few milliseconds, only when needed, unlike a more complex solution that inflates the size of the data structure all the time, whether you need it or not. Of course there may be use-cases where we really do need a more complex, clever solution, and are willing to trade off space for time (or sometimes time for space). But chances are YAGNI. > Followed by a search of much of that list to get the nth index. That's incorrect. Despite the name, Python lists aren't linked lists[1] where you have to traverse N items to get to the Nth item. They're arrays, where indexing requires constant time. [...] > If they keep demanding one function to master all, you can end up with > fairly awful spaghetti code. https://en.wikipedia.org/wiki/God_object [1] Technically speaking, this is not a requirement of the language, only a "quality of implementation" question. A Python interpreter could offer built-in lists using linked lists under the hood, with O(N) indexing. But all the major implementations -- CPython, Stackless, PyPy, Jython, IronPython, Cython, Nuitka, even (I think) MicroPython -- use arrays as the list implementation. Given how simple arrays are, I think it is fair to assume that any reasonable Python interpreter will do the same. -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor