On Tue, May 21, 2019 at 12:24 PM Rich Shepard <rshep...@appl-ecosys.com> wrote:
> On Tue, 21 May 2019, Francisco Olarte wrote: > > > From how you say it, I assume you have some data in your original > > dumps which can relate boths, lets assume it's org_name, but may be an > > org-code. If you do not have it it means you cannot match people to > > orgs in your data, all is lost. > > Francisco, > > Not yet with these new data. > > I'll manually insert the org_id numbers from the organizations table into > the people table. > To me, this is the key to your problem what will either make this a time saver or time waster. Somehow you are accounting for what uniquely identifies organizations, right? Say there are 5 fields that correspond to an organization. I assume then you are creating only one new org_id for each unique combination of these fields? Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier. Then you can use ctid (unique internal identifier for each row) to join back. You use SQL like this: SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier FROM table; Assume in example below that your unique "org" rows are the first 3 fields: 1. Load your data file into a loading table like so: CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text, person_name text); \copy loader from 'my_data.csv' with csv header ALTER TABLE loader ADD COLUMN org_id INT; Example data: INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry'); INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob'); INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice'); INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris'); INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason'); INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice'); 2. Load org table: test=# CREATE TABLE organizations (org_id serial primary key, org_name text, org_stuff_1 text, org_stuff_2 text); CREATE TABLE test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2) test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2 test-# FROM loader; INSERT 0 4 3. Build mapping directly and update: -- build hash of org fields in loader table, take ctid in order to map back later WITH map_source AS ( SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier FROM loader) -- build hash of org fields in organizations table to join back to loader and bring in org_id of course , map_org AS ( SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier FROM organizations) -- map by joining together on hash_identifier , final_map AS ( SELECT org_id, ctid FROM map_source l INNER JOIN map_org o USING (hash_identifier) ) -- Perform update UPDATE loader l SET org_id = fm.org_id FROM final_map fm WHERE fm.ctid = l.ctid; Final data ready for the person table to be populated: test=# table organizations; org_id | org_name | org_stuff_1 | org_stuff_2 --------+----------+-------------+------------- 1 | a | | b 2 | a | | 3 | a | | c 4 | b | | c (4 rows) test=# table loader; org_name | org_stuff_1 | org_stuff_2 | person_name | org_id ----------+-------------+-------------+-------------+-------- a | | | Jerry | 2 | a | | b | Bob | 1 | a | | b | Janice | 1 | a | | c | Chris | 3 | b | | c | Jason | 4 | a | | | Alice | 2 | (6 rows) Hope this helps! Thanks, Jeremy