> My problem. I have lists of substrings associated to values:
> 
> ['a','b','c','g'] => 1
> ['a','b','c','h'] => 1
> ['a','b','c','i'] => 1
> ['a','b','c','j'] => 1
> ['a','b','c','k'] => 1
> ['a','b','c','l'] => 0  # <- Black sheep!!!
> ['a','b','c','m'] => 1
> ['a','b','c','n'] => 1
> ['a','b','c','o'] => 1
> ['a','b','c','p'] => 1
> 
> I can check a bit of data against elements in this list and determine whether
> the value to be associated to the data is 1 or 0.
> 
> I would like to make my matching algorithm smarter so I can reduce the total
> number of lists:
> 
> ['a','b','c','l'] => 0  # If "l" is in my data I have a zero
> ['a','b','c'] => 1      # or a more generic match will do the job
> 
> I am trying to think of a way to perform this "reduction", but I have a 
> feeling I
> am reinventing the wheel.
> 

Well, you are right about it being vague <grin>.

Ultimately, this would depend on knowing more about what you are trying to 
achieve and what types of data you are actually using, as well as what the 
ranges of potential values are.

So, for your simple example, I would do exactly what you suggested (check for 
the "l" and call it zero, otherwise check for the prefix and call it one.

but, if there are lots of potential different black sheep, and/or other 
potential answers, this would get complex pretty quickly.  Ian Kelly noted 
using a trie structure, allowing searching by prefix's, that could make sense 
if your lists always differ at the end as in your example, but might not make 
sense if the data is more random.   

There are lots of searching, sorting, and organizing modules and data 
structures out there, but without a bit more info, I am not sure that any 
response is going to be better than any other.

Some things to consider are, 
- what is predictable, and what is not... and how do you determine the value 
from the predictable parts.
     (your example was very predictable; hence it was very easy)
- how is the determination handled?  (Math?  String matching?)
    (if the method of determining it can be handled with math, you may be able 
to craft a formula to figure it out faster)
- how many of each object type are you looking at?
  (if you are comparing < 100 objects, sometimes brute force is much easier 
anyway, if > 1,000,000, using a database or other method may be required)

And some options to look at:
- trie structure (as Ian noted)
   (good if the data varies based on how "deep" in the tree you are.)
- regex search/match
    (good for more text based comparisons than you "seem" to be suggesting)
- build a custom class that has the intelligence to determine the value for 
each object itself.
   (good for more complex issues, but probably increases the size of each 
object)
- create an index or caching structure of some sort as you find the objects,
  (good for saving computation time if the determination is "hard" and you hit 
the same ones again and again)
- create a dictionary structure instead of a list.
  (good for fast lookups with known data)
- using something like pandas or another data analysis library (SciPy, NumPy, 
iPython Notebook).
  (especially good if your data is more numbers based than you seem to be 
indicating)
- writing the data to a database and using that for matching.
  (good If you have LOTS of data to compare).

Some of these are probably overkill, some probably won't work at all for what 
you are trying to achieve.

Dan


-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to