On 9/09/19 4:02 AM, A S wrote:
My problem is seemingly profound but I hope to make it sound as simplified as 
possible.....Let me unpack the details..:

...

These are the folders used for a better reference ( 
https://drive.google.com/open?id=1_LcceqcDhHnWW3Nrnwf5RkXPcnDfesq ). The files 
are found in the folder.


The link resulted in a 404 page (for me - but then I don't use Google). So, without any sample data...

> 1. I have one folder of Excel (.xlsx) files that serve as a data dictionary.
>
> -In Cell A1, the data source name is written in between brackets
>
> -In Cols C:D, it contains the data field names (It could be in either col C or D in my actual Excel sheet. So I had to search both columns
>
> -*Important: I need to know which data source the field names come from
>
> 2. I have another folder of Text (.txt) files that I need to parse through to find these keywords.


Recommend you start with a set of test data/directories. For the first run, have one of each type of file, where the keywords correlate. Thus prove that the system works when you know it should.

Next, try the opposite, to ensure that it equally-happily ignores, when it should.

Then expand to having multiple records, so that you can see what happens when some files correlate, and some don't.

ie take a large problem and break it down into smaller units. This is a "top-down" method.


An alternate design approach (which works very well in Python - see also "PyTest") is to embrace the principles of TDD (Test-Driven Development). This is a process that builds 'from the ground, up'. In this, we design a small part of the process - let's call it a function/method: first we code some test data *and* the expected answer, eg if one input is 1 and another is 2 is their addition 3? (running such a test at this stage will fail - badly!); and then we write some code - and keep perfecting it until it passes the test.

Repeat, stage-by-stage, to build the complete program - meantime, every change you make to the code should be tested against not just 'its own' test, but all of the tests which originally related to some other smaller unit of the whole. In this way, 'new code' can be shown to break (or not - hopefully) previously implemented, tested, and 'proven' code!

Notice how you have broken-down the larger problem in the description (points 1 to 5, above)! Design the tests similarly, to *only* test one small piece of the puzzle (often you will have to 'fake' or "mock" data-inputs to the process, particularly if code to produce that unit's input has yet to be written, but regardless 'mock data' is thoroughly controlled and thus produces (more) predictable results) - plus, it's much easier to spot errors and omissions when you don't have to wade through a mass of print-outs that (attempt to) cover *everything*! (IMHO)

Plus, when a problem is well-confined, there's less example code and data to insert into list questions, and the responses will be equally-focussed!


Referring back to the question: it seems that the issue is either that the keywords are not being (correctly) picked-out of the sets of files (easy tests - for *only* those small section of the code!), or that the logic linking the key-words is faulty (another *small* test, easily coded - and at first fed with 'fake' key-words which prove the various test cases, and thus, when run, (attempt to) prove your logic and code!)


--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to