An interesting discussion ... thanks :). I tried "grep -f" on a couple of large text files I had.
I have a file of "words", one word per line, all lower case ascii, that I had accumulated from various spell checkers and such. This "words" file has 395,696 unique (more or less) English words in it. I also have a large "hosts" file that I use as part of my spammy site filtering, with 345,552 lines in it, each line with the IP addr 127.0.0.1 and a host URL. I tried "grep -f words hosts", using grep (GNU grep) 2.26. I monitored it from another screen. Most lines in the hosts file matched one or more of the shorter words in the words file. But the resident memory foot print kept climbing, non-stop, at some rate over 1 GByte of RAM per second. Within less than a minutes time, after line 157 of the 345,552 lines in the hosts file was output as a match, all my 32 GBytes of RAM were consumed, and I spent the next minute regaining control of my system. So far as I know, no useful searching occurred once I ran out of RAM. I would suggest six things: 1) Something seems broken if the "grep -f" algorithm requires continuously increasing the resident RAM size, even after it has accomplished whatever initialization it needs and is busy scanning the search space for matches. At the rate it was growing, I would have needed (345,552/157) * 32 GBytes which is about 70 petabytes of RAM, which is larger than any system that I know of, and it would have taken (wild estimate) 20 hours to complete. Ouch! 2) In those cases where grep -f seemed to take "forever" ... or however long until the user gave up, I would guess that's because the computer memory was thrashing to disk. As soon as a problem no longer fits in RAM, it slows down by multiple orders of magnitude. Take for example my above test, in which I was getting a few matching lines of output per second ... until I ran out of RAM and couldn't see anymore because my window would not even refresh or show further output or show my mouse cursor. 3) Yes, the default behaviour should not require the user to choose the algorithm. However if that behaviour cannot be or is not yet provided, then the default should be to do the best it can, but there should also be options to select the algorithm, if multiple algorithms are coded, for use in those cases where the user "requires the very best" (or at least cannot tolerate what might be dreadfully bad) for their particular problem. 4) When I have many things to search for, as with the large "words" file above, I look for ways to simplify the space being searched in, so that, for example, I only have to search for matching words at the start of a line or specific field separator character. Then I can load my "words" into some sort of tree or associative array (such as a Python dictionary, Patricia Trie, Judy Tree, or lmdb database) and look each line or field up in that data structure. 5) I've never had to actually solve a problem with both a long list of "words" to search for, and where the possible starting characters for a matching word could be almost any character in the space to search, but if I did have to solve such a problem, I'd guess that I'd try my approach (4) above, and just accept that I was going to have to loop on every character in the search space, starting a lookup into my tree or associative array with the characters beginning at that point. Since the "words" file in my above example was about 4 MBytes, a tree or associative array with those words loaded should fit in 5 to 10 MBytes of RAM, give or take, and the solution would proceed reliably without consuming much more RAM than that ... however just with an inner loop that was executed 12174500 times, for each character as a possible starting point, in my "hosts" file. Fortunately, only 6414428 characters in that "hosts" file were lower case letters, so the other 5760072 characters (not lower case letters) would loop quickly, not even matching on the first lookup. 6) The problem considered in (4) above is actually one that I run into regularly in some other work that I am doing. I have a command that I use from the shell prompt (along with artful use of many other classic Unix shell commands) that is implemented using a Python dictionary, to solve such problems much more quickly than "grep -f" can handle. If you're an ancient Unix shell hacker such as myself, you might find it useful: http://pauljackson.us/dict.py Using this dict command, I can find the 10392 entries in my "hosts" file that have the second, from the left, dot-separated component of its URL matching one of the 395,696 words in my "words" file, in about 1.4 seconds of CPU time for the dict command, and about 4 Mbytes of RAM its dictionary, using the following shell commands: < words sed 's/.*/& & MATCH/' > /tmp/words.1 < hosts awk -F. '/^127.0.0.1/ {print $5, $0}' | dict /tmp/words.1 | grep MATCH | wc The awk -F. command copies the second, from the left, dot-separated component of the URL on each line to front of the line. Each active line in my hosts file has an IP address of 127.0.0.1. I habitually use dict to replace any first field that matches with that field, and/or whatever other replacement value I want, plus the word "MATCH", so that I can then easily grep out just the matching lines. Yes - a lot of detail in this item (6) - but when you're routinely solving problems that can be cast in this form, the "1.6 seconds" for even a rather large example is compelling. In my most common use case, I am dealing with over 10 million entries in one or the other of the words I am looking for, or the search space I am looking in. No variant of "grep -f" that I have ever seen can handle such problems. -- Paul Jackson p...@usa.net