On 3/6/2023 12:49 PM, avi.e.gr...@gmail.com wrote:
Thomas,

I may have missed any discussion where the OP explained more about proposed 
usage. If the program is designed to load the full data once, never get updates 
except by re-reading some file, and then handles multiple requests, then some 
things may be worth doing.

It looked to me, and I may well be wrong, like he wanted to search for a string anywhere 
in the text so a grep-like solution is a reasonable start with the actual data being 
stored as something like a list of character strings you can search "one line" 
at a time. I suspect a numpy variant may work faster.

And of course any search function he builds can be made to remember some or all 
previous searches using a cache decorator. That generally uses a dictionary for 
the search keys internally.

But using lots of dictionaries strikes me as only helping if you are searching for text anchored to the start of a line 
so if you ask for "Honda" you instead ask the dictionary called "h" and search perhaps just for 
"onda" then recombine the prefix in any results. But the example given wanted to match something like 
"V6" in middle of the text and I do not see how that would work as you would now need to search 26 
dictionaries completely.

Well, that's the question, isn't it? Just how is this expected to be used? I didn't read the initial posting that carefully, and I may have missed something that makes a difference.

The OP gives as an example a user entering a string ("v60"). The example is for a model designation. If we know that this entry box will only receive model, then I would populate a dictionary using the model numbers as keys. The number of distinct keys will probably not be that large.

For example, highly simplified of course:

>>> models = {'v60': 'Volvo', 'GV60': 'Genesis', 'cl': 'Acura'}
>>> entry = '60'
>>> candidates = (m for m in models.keys() if entry in m)
>>> list(candidates)
['v60', 'GV60']

The keys would be lower-cased. A separate dictionary would give the complete string with the desired casing. The values could be object references to the complete information. If there might be several different models models with the same key, then the values could be lists or dictionaries and one would need to do some disambiguation, but that should be simple or quick.

It all depends on the planned access patterns. If the OP really wants full-text search in the complete unstructured data file, then yes, a full text indexer of some kind will be useful. Whoosh certainly looks good though I have not used it. But for populating dropdown lists in web forms, most likely the design of the form will provide a structure for the various searches.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail....@python.org> On 
Behalf Of Thomas Passin
Sent: Monday, March 6, 2023 11:03 AM
To: python-list@python.org
Subject: Re: Fast full-text searching in Python (job for Whoosh?)

On 3/6/2023 10:32 AM, Weatherby,Gerard wrote:
Not sure if this is what Thomas meant, but I was also thinking dictionaries.

Dino could build a set of dictionaries with keys “a” through “z” that contain 
data with those letters in them. (I’m assuming case insensitive search) and 
then just search “v” if that’s what the user starts with.

Increased performance may be achieved by building dictionaries “aa”,”ab” ... 
“zz. And so on.

Of course, it’s trading CPU for memory usage, and there’s likely a point at 
which the cost of building dictionaries exceeds the savings in searching.

Chances are it would only be seconds at most to build the data cache,
and then subsequent queries would respond very quickly.


From: Python-list <python-list-bounces+gweatherby=uchc....@python.org> on behalf of 
Thomas Passin <li...@tompassin.net>
Date: Sunday, March 5, 2023 at 9:07 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Fast full-text searching in Python (job for Whoosh?)

I would probably ingest the data at startup into a dictionary - or
perhaps several depending on your access patterns - and then you will
only need to to a fast lookup in one or more dictionaries.

If your access pattern would be easier with SQL queries, load the data
into an SQLite database on startup.

IOW, do the bulk of the work once at startup.
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$>


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to