On 03May2019 22:07, Sean Murphy <mhysnm1...@gmail.com> wrote:
I have a list of strings which has been downloaded from my bank. I am trying
to build a program to find the unique string patterns which I want to use
with a dictionary. So I can group the different transactions together. Below
are example unique strings which I have manually extracted from the data.
Everything after the example text is different. I cannot show the full data
due to privacy.

WITHDRAWAL AT HANDYBANK
PAYMENT BY AUTHORITY
WITHDRAWAL BY EFTPOS
WITHDRAWAL MOBILE
DEPOSIT          ACCESSPAY

Note: Some of the entries, have an store name contained in the string
towards the end. For example:

WITHDRAWAL BY EFTPOS 0304479 KMART 1075       CASTLE HILL 24/09

Thus I want to extract the KMART as part of the unique key. As the shown
example transaction always has a number. I was going to use a test condition
for the above to test for the number. Then the next word would be added to
the string for the key.
[...]

I'm assuming you're handed the text as one string, for example this:

 WITHDRAWAL BY EFTPOS 0304479 KMART 1075       CASTLE HILL 24/09

I'm assuming is a single column from a CSV of transactions.

I've got 2 observations:

1: For your unique key, if it is a string (it needn't be), you just need to put the relevant parts into your key. FOr the above, perhaps that might be:

 WITHDRAWAL 0304479 KMART

or:

 WITHDRAWAL KMART 1075

etc depending on what the relevant parts are.
2: To pull out the relevant words from the description I would be inclined to do a more structured parse. Consider something like the following (untested):

 # example data
 desc = 'WITHDRAWAL BY EFTPOS 0304479 KMART 1075       CASTLE HILL 24/09'
 # various things which may be recognised
 method = None
 terminal = None
 vendor = None
 vendor_site = None
 # inspect the description
 words = desc.split()
 flavour = desc.pop(0)     # "WITHDRAWAL" etc
 word0 = desc.pop(0)
 if word0 in ('BY', 'AT'):
   method = word0 + ' ' + desc.pop(0)    # "BY EFTPOS"
 elif word0 in ('MOBILE', 'ACCESSPAY'):
   method = word0
 word0 = words.pop(0)
 if word0.isdigit():
   # probably really part of the "BY EFTPOS" description
   terminal = word0
   word0 = words.pop(0)
 vendor = word0
 word0 = words.pop(0)
 if word0.isdigit():
   vendor_site = word0
   word0 = words.pop(0)
 # ... more dissection ...
 # assemble the key - include only the items that matter
 # eg maybe leave out terminal and vendor_site, etc
 key = (flavour, method, terminal, vendor, vendor_site)

This is all rather open ended, and totally dependent on your bank's reporting habits. Also, it needs some length checks: words.pop(0) will raise an exception when "words" is empty, as it will be for the shorter descriptions at some point.

The important point is to get a structured key containing just the relevant field values: being assembled as a tuple from strings (immutable hashable Python values) it is usable as a dictionary key.

For more ease of use you can make the key a namedtuple:

 from collections import defaultdict, namedtuple
 ........
 KeyType = namedtuple('KeyType', 'flavour method vendor')
 transactions = defaultdict(list)
 ........ loop over the CSV data ...
   key = KeyType(flavour, method, vendor)
   transactions[key].append(transcaction info here...)

which gets you a dictionary "transactions" containing lists of transaction record (in whatever form you make them, when might be simply the row from the CSV data as a first cut).

The nice thing about a namedtuple is that the values are available as attributes: you can use "key.flavour" etc to inspect the tuple.

Cheers,
Cameron Simpson <c...@cskk.id.au>
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to