On 04/11/2020 19:12, Avi Gross wrote: > My comments at end: > > -----Original Message----- > From: Python-list <python-list-bounces+avigross=verizon....@python.org> On > Behalf Of duncan smith > Sent: Wednesday, November 4, 2020 1:09 PM > To: python-list@python.org > Subject: Re: Find word by given characters > > On 04/11/2020 04:21, Avi Gross wrote: >> Duncan, my comments below yours at end. >> >> ---YOURS--- >> The Counter approach only requires iterating over the letters once to >> construct the letters bag, then each word once to create the relevant >> word bag. After that it's (at worst) a couple of lookups and a >> comparison for each unique character in letters (for each word). >> >> Your approach requires iteration over the words to create the lists of >> characters. Then they are (potentially) iterated over multiple times >> looking for the characters in letters. There's also the cost of >> removing items from arbitrary positions in the lists. Also, it seems >> that the character frequencies must be equal, and your approach only >> ensures that the words contain at least the required numbers of > characters. >> >> In computational terms, if the first approach is something like O(n+m) >> for n letters and words of length m, your algorithm is more like O(nm). >> Not to say that it will be slower for all possible letters and >> dictionaries, but probably for any realistic cases and a lot slower >> for large enough dictionaries. >> >> Duncan >> >> --MINE--- >> >> I appreciate your analysis. I have not looked at the "counter" >> implementation and suspect it does some similar loops within, albeit >> it may be implemented in a compiled language like C++. >> > > Before the introduction of counters I would have used a dict to create a > mapping of characters to counts. That would require iterating over the > characters in a string only once to create / update the dict entries, and I > doubt counters are implemented less efficiently than that. > >> I did not write out my algorithm in Python but have done it for >> myself. It runs fast enough with most of the time spent in the slow I/O > part. >> >> We can agree all algorithms have to read in all the words in a data file. >> There may be ways to store the data such as in a binary tree and even >> ways to thus prune the search as once a node is reached where all >> required letters have been found, all further words qualify below that >> point. If you match say "instant" then instants and instantiation >> would be deeper in the tree and also qualify assuming extra letters are > allowed. > > I don't see how a binary tree would be useful. As I've pointed out in > another post, there are other data structures that could be useful. What I > had in mind was a trie (prefix tree). But it's only the distinct characters > and frequencies that are relevant and so I'd exploit that (and one or two > other things) to reduce space and improve search. > > We may differ on >> the requirement as I think that the number of repeats for something >> like a,t,t require to be at least as big as in "attend" but that >> "attention" with yet another "t" would also be OK. If I am wrong, >> fine, but I note the person requesting this has admitted a certain >> lack of credentials while also claiming he made up a scenario just for >> fun. So this is not actually a particularly worthy challenge let alone > with a purpose. >> >> My impression is that the average word length would be something small >> like 5-7. The number of words in a dictionary might be 100,000 or >> more. So if you want efficiency, where do you get the bang for the buck? >> >> I would argue that a simple test run on all the words might often >> narrow the field to a much smaller number of answers like just a few >> thousand or even much less. Say you test for the presence of "aeiou" >> in words, in whatever order. That might be done from reading a file >> and filtering out a relatively few potential answers. You can save >> those for second round to determine if they are fully qualified by >> any additional rules that may involve more expensive operations. >> > > Your proposed approach didn't involve any trees (or tries) or filtering of > words. So I don't see how any of this justifies it. > >> How fast (or slow) are regular expressions for this purpose? Obviously >> it depends on complexity and something like "^[^aeiou]*[aeiou] >> [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] >> [^aeiou]*$" >> >> would be easy to construct once but likely horribly inefficient in >> searching and a bit of overkill here. I suspect there is already some >> simple C function that could be used from within python that looks >> like findall(choices, word) that might return how many of the letters >> in choices were found in word and you simply comparer that to >> length(word) perhaps more efficiently. >> >> It looks easier to check if a character exists in one of the ways >> already discussed within python using a loop as discussed. Something >> as simple as >> this: >> >> needed = "aeiou" >> trying = "education" >> found = all([trying.find(each) >= 0 for each in needed ]) >> print(found) >> >> trying = "educated" >> found = all([trying.find(each) >= 0 for each in needed ]) >> print(found) >> The above prints >> >> My point is you can use the above to winnow down possible answers and >> only subject that smaller number to one of many tests including using >> a counter, making a dictionary you can query (albeit probably slower >> for fairly short >> words) and so on. >> > > That still involves iterating over a word's characters repeatedly. And > that's just to identify if a word is a possible match. > >> Back to your other point, you suggest iterating over characters ( I >> happened to do it in a list) multiple times can result in duplication >> as the same characters are searched repeatedly. Sure. I simply note >> most words are short. How much overhead is there searching for say >> nine characters five times? Compare that to say creating a dictionary >> or other data structure once, and then making a hash out of each letter > being searched for? > > Exactly. If the words are short enough, and removing items from arbitrary > positions in lists is quick enough, and dict creation and lookups are > sufficiently expensive - then it might be quicker. But we have a reasonable > idea which of these things are expensive and which are quick. If in doubt we > can time it. > > I can >> envision many possible ways to speed this up but many involve more use >> of memory or all other kinds of overhead such as initializing an >> object and calling various member functions and then deleting it or > re-initializing it. >> My suggestion of removing items from a short list is not ideal and >> depending on how a bag was otherwise implemented, I could see it being >> better but again, how much is saved if you have a nine letter word >> that may contain nine unique letters or even have repeated letters as >> in banana? You end up saving a copy of the letter (or a hashed >> version) plus some integer value (perhaps just a byte) for the count. >> I would need to see the implementation and do some speed tests, ... >> >> In real programming, there are many choices. This is a bit of a >> hypothetical but a part of me suggests the best answer for a quick >> prototype is something simple and not necessarily tuned for speed, just to > show it can be done. >> Then, if the code is used often and is slow, sure, enhance it. Back in >> my UNIX days, I might have started with a rather inefficient pipeline >> for my example like: >> >> grep 'a' filename | grep 'e' | grep 'i' | grep 'o' | grep 'u' | ... >> >> That would produce a list of candidates, again not efficiently. I >> might have switched to using better regular expressions, perhaps using >> sed instead, or AWK or even PERL. But if it had to be fast production >> code, I might have used C then C++ or whatever latest tool I had. >> >> These days, some things run so fast that we get lazy. When doing a >> pipelined approach in R or Python that analyzes data, I often >> effectively have to apply a set of conditions to lots of data, one >> after another. Do you trim the number of columns in a data frame first >> or do you apply a condition that trims the number of rows first? >> Clearly if you need 5 columns out of 666, then chances are you narrow >> the columns so there is less useless copying in later parts of the >> pipeline. . But what if you have multiple conditions that each remove >> some rows and it depends on what the data looks like? For example, >> remove all rows in which one variable is NA and also all rows where >> another two variables are not equal to each other and so on. Isn't >> part of the consideration how expensive each comparison is? Imagine if >> you want to remove rows where a word in one column is not in the >> scrabble dictionary and your stupid function has to read it in each >> and every time! I would want that to be done last after cheaper >> methods removed say all words longer than >> 10 characters or that had uppercase or numeric parts or spaces or were NA. >> >> So if the functionality in the topic above was really being built, >> and the functionality was being called repeatedly as some game >> progressed, you definitely might want to speed it up, if nothing else, >> by reading the dictionary file into memory once, perhaps into a data >> structure like a modified binary tree which could be searched with >> tail recursion or short-cuts. And in some cases, you might even want >> odd things like to cache the results of earlier queries in a >> dictionary and skip a full repeat search entirely. >> >> You might argue this situation always gets called with different > arguments. >> Perhaps but imagine writing a program where the computer plays against >> a human. The computer might look at every open region of a scrabble >> board and take the tiles already in their "hand" and add zero or more >> tiles indicating characters already on the game board that it wanted >> to extend into or beyond. Say it sees the letter "r" and wonders if it >> could place all letters immediately to the left of that "r" and then >> to see if the "r" could be in multiple places within the word formed >> and finally if it could make a word starting with "r". That could be >> multiple identical calls, perhaps only differing in the order of the >> characters it is searching for. And it might repeat similar calls in >> another part of the board that has a dangling "r", perhaps the same >> one it did not use before. Obviously, the goal here would be to find >> all potentially possible words it could actually fit into that slot so >> further analysis would happen and it would investigate which method >> provided more points. It may even look ahead to see if that move >> allowed the opponent a better-countermove and so on into deeper >> analyses. Much of the expense of deciding what to actually do would be > outside the functionality described. >> >> Admittedly, I see no reason for it to actually ask for these words as >> described that could be longer, but only proper permutations of the > letters. >> My point is I can envision places where caching earlier results would >> be potentially a great speedup. Heck, I suggest the problem could be >> turned around as the number of combinations might be computed and >> simply tested to see if they exist in a Python dictionary made from the > external dictionary. >> >> Heck, why even write the main routine in an interpreted language when >> you can link in a function written to be even more efficient? >> >> But enhancements like that come with a cost. Would you pay a >> programmer an extra month to speed something like that up 5%? Maybe, >> but often not. And if you add in the typical costs in a large >> organization of various kinds of testing needed for a more complex or > subtle approach, ... >> >> Avi >> > > So the upshot of this is that the use of counters is premature optimisation? > I don't agree. I'm not even convinced that your proposed approach is > simpler. We need to compare the frequencies of characters, and counters give > us those frequencies directly. Then we compare them. > Premature optimisation might be "write a trie class, then ...". > > I like to start with clean code that's not obviously inefficient and that I > might not have to revisit later (to tidy up or optimise). For me, that would > be the approach using counters. > > Duncan > > ----MY COMMENTS on 11/02/2020--- > > Duncan, my main point is an academic discussion and I have no experience or > complaints about using the excellent suggestions of solving a problem using > a counter. Once a good implementation of something is available and reliable > and efficient, anyone who knows how to use it can use it to write decent > code. > > The person posting this "problem"' seems to be learning by making up vague > projects for themselves. When I have taught programming, the emphasis has > been to learn the LANGUAGE first and how to do things in the core language > and perhaps standard additions. Aspects like loops and lists are fairly > standard in Python. Arguably you could suggest so are many aspects of > object-oriented and functional programming these days. But generally (and > perhaps wrongly) we often are expected to learn how to solve problems using > the basics, often partially to learn how to compose a logical set of steps > as an algorithm. Later we may learn that much of what we need has already > been taken care of and you have modules or packages or libraries that often > let you work at a higher level. > > So an approach that effectively uses an abstraction that counts everything > in one logical step and allows you to ask if one thing is a subset of > another is probably even better from a programming perspective. I am not > saying it is a premature optimization. It may turn out to not be optimal > though given the scenario where small enough words dominate. It is like with > dictionaries in Python where the inner method has to allocate space for a > hash table of some size that can grow. Given the nature of this problem, > there is an obvious way to make a data structure that is more compact and > probably faster. > > Again, just as an academic thought experiment, what we require is the > ability to store info about no more than 26 lower-case letters of the > alphabet and count how many. I mean, for now, assume we are limited to > ASCII, not all Unicode characters. So something as simple as a vector of > fixed length 26 would be enough and the hash function would simply subtract > a fixed number offset from the value of the character as a byte of 8 bits. > The count value is very limited as the longest scrabble word might be on the > order of length 20 and no word is likely to have more than 5 or so > repetitions of any one character. So having a "counter" bag/dictionary may > be overkill in this scenario. As for the subset methodology, it can be a > simple vectorized operation as something that acts like a vector (perhaps > using numpy) can be used to do something like A >= B to get a vector of > True/False and require them all to be True if it is a subset. > > So, to be clear, I see lots of ways to program most things. Some are > arguably better for instructional purposes in a particular language and some > are easier on the programmer and some are more efficient and some are > easier/harder to debug and so on. When I, or others, present an alternative, > it often does not mean existing proposals are wrong or even bad. > > Your points about how often characters or list items are scanned are valid. > But O(N) type arguments must also consider the cost of such access not just > how many times. Also valid are your concerns of costs in removing list > items in random locations. And, no, my original example did not use exotic > things like trees. But I have had many years of experience in many > languages and in each I develop a toolbag to work with. For longer lists > where I wanted to be able to insert or delete at random, I might use > something like linked lists that allow efficient changes without copying the > entire list repeatedly. Of course, that works better in languages that > support something like pointers. For short lists like in this example, the > overhead may be not worth it and a bit of copying may be as fast or faster > and not worth worrying about. > > I think I have said more than enough and won't continue this thread. There > are many fairly decent ways of doing things and this particular problem is > not particularly interesting as I see little actual reason to build it. But > having said that, I note lots of apps for my phone and elsewhere that are > games for making words in many different ways, often by permuting a set of > letters you can use. Underneath the hood, I suspect they have a dictionary > they have to consult, and wonder how they implement their searches. > -- > https://mail.python.org/mailman/listinfo/python-list >
You asked, "what would be wrong with this approach" and "am I missing something about other approaches being better or more efficient". Those are the questions I answered. I explained why I thought your approach would be likely to be comparatively slow for realistic use cases. The OP says he is coming back to Python after a lay off (having used Python2 before). So I didn't see this as a teaching exercise. Rather, I showed how I would do it in Python3. Duncan -- https://mail.python.org/mailman/listinfo/python-list