Hello. I am working on a project where one system (System A) contains seven text fields (unstructured data for comments). I have concatenated all of the fields into a single field.
There is a second system (System B) containing two unstructured fields that capture text comments. I have concatenated these fields into a single field just as I did for the first system. This system contains highly sensitive and prohibitive data. The issue that I'm trying to solve is that there should not be any text data from System B (sensitive narratives, investigative IDs, etc.) In essence, I am trying to find the following three items: 1) Find direct references to investigations ("Investigation number ABC123") 2) Language that talks about references (i.e. "Jane Doe is under investigation") 3) Actual cut-and-paste segments where they copied something verbatim from System B to System A in the commentary fields. It seems as though I may have to use different text similarity (comparison between System A and System B text) or search techniques for one or more of the three items. I was thinking that Cosine Similarity Computation (CSC) would perhaps be useful, but I thought I would solicit some advice as I'm a recent text analyst using Python. Thank you in advance. Kenneth R Adams Compliance Technology and Analytics TAS -Text Analytics as a Service Wells Fargo & Co. | 401 South Tryon Street, Twenty-sixth Floor | Charlotte, NC 28202 MAC: D1050-262 Cell: 704-408.5157 kenneth.r.ad...@wellsfargo.com<mailto:kenneth.r.ad...@wellsfargo.com> [WellsFargoLogo_w_SC] -- https://mail.python.org/mailman/listinfo/python-list