Steve, I am going to just respond to one part of your message and will snip the rest. I am not is disagreement with most of what you say and may simply stress different aspects. I will say that unless I have reason to, I don't feel a need to test speeds for an academic discussion. Had this been a real project, sure. Even then, if it will need to run on multiple machines using multiple incarnations of python, the results will vary, especially if the data varies too. You suggest that discussions backed by real data are better. Sure. But when a discussion is abstract enough, then I think it perfectly reasonable to say "may be faster" to mean that until you try it, there are few guarantees. Many times a method seems superior until you reach a pathological case. One sorting algorithm is fast except when the data is already almost fully sorted already.
So why do I bother saying things like MAY? It seems to be impossible to please everybody. There are many things with nuance and exceptions. When I state things one way, some people (often legitimately) snipe. When I don't insist on certainty, other have problem with that. When I make it short, I am clearly leaving many things out. When I go into as much detail as I am aware of, I get feedback that it is too long or boring or it wanders too much. None of this is a problem as much as a reality about tradeoffs. So before I respond, here is a general statement. I am NOT particularly interested in much of what we discuss here from a specific point of view. Someone raises a question and I think about it. They want to know of a better way to get a random key from a dictionary. My thought is that if I needed that random key, maybe I would not have stored it in a dictionary in the first place. But, given that the data is in a dictionary, I wonder what could be done. It is an ACADEMIC discussion with a certain amount of hand waving. Sometimes I do experiment and show what I did. Other times I say I am speculating and if someone disagrees, fine. If they show solid arguments or point out errors on my part or create evidence, they can change my mind. You (Steve) are an easy person to discuss things with but there are some who are less. People who have some idea of my style and understand the kind of discussion I am having at that point and who let me understand where they are coming from, can have a reasonable discussion. The ones who act like TV lawyers who hear that some piece of evidence has less than one in a quadrillion chance of happening then say BUT THERE IS A CHANCE so reasonable doubt ... are hardly worth debating. You replied to one of my points with this about a way to partition data: --- The obvious solution: keys = list(mydict.keys()) random.shuffle(keys) index = len(keys)*3//4 training_data = keys[:index] reserved = keys[index:] --- (In the above, "---" is not python but a separator!) That is indeed a very reasonable way to segment the data. But it sort of makes my point. If the data is stored in a dictionary, the way to access it ended up being to make a list and play with that. I would still need to get the values one at a time from the dictionary such as in the ways you also show and I omit. For me, it seems more natural in this case to simply have the data in a data frame where I have lots of tools and methods available. Yes, underneath it all providing an array of indices or True/False Booleans to index the data frame can be slow but it feels more natural. Yes, python has additional paradigms I may not have used in R such as list comprehensions and dictionary comprehensions that are conceptually simple. But I did use the R-onic (to coin a phrase nobody would ironically use) equivalents that can also be powerful and I need not discuss here in a python list. Part of adjusting to python includes unlearning some old habits and attitudes and living off this new land. [[Just for amusement, the original R language was called S so you might call its way of doing things Sonic.]] I see a balance between various ways the data is used. Clearly it is possible to convert it between forms and for reasonable amounts of data it can be fast enough. But as you note, at some point you can just toss one representation away so maybe you can not bother using that in the first place. Keep it simple. In many real life situations, you are storing many units of data and often have multiple ways of indexing the data. There are representations that do much of the work for you. Creating a dictionary where each item is a list or other data structure can emulate such functionality and even have advantages but if your coding style is more comfortable with another way, why bother unless you are trying to learn other ways and be flexible. As I have mentioned too many times, my most recent work was in R and I sometimes delight and other times groan at the very different ways some things are done when using specific modules or libraries. But even within a language and environment there are radical differences and approaches. The naked (albeit evolving) languages often offer a reasonable way to do things but developers often add new layers and paradigms that can become the more standard way of doing things. Base python can do just about anything with just lists. All you have to remember is that you stored the zip code in the 23rd element. But programmers created things like named tuples or specialized objects like dataframes that may be easier or more powerful or less prone to some kinds of errors or whatever. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor