Re: Bulk inserts into two (related) tables

Jeremy Finzel Tue, 21 May 2019 14:58:58 -0700

On Tue, May 21, 2019 at 12:24 PM Rich Shepard <[email protected]>
wrote:

> On Tue, 21 May 2019, Francisco Olarte wrote:
>
> > From how you say it, I assume you have some data in your original
> > dumps which can relate boths, lets assume it's org_name, but may be an
> > org-code. If you do not have it it means you cannot match people to
> > orgs in your data, all is lost.
>
> Francisco,
>
> Not yet with these new data.
>
> I'll manually insert the org_id numbers from the organizations table into
> the people table.
>

To me, this is the key to your problem what will either make this a time
saver or time waster.  Somehow you are accounting for what uniquely
identifies organizations, right?

Say there are 5 fields that correspond to an organization.  I assume then
you are creating only one new org_id for each unique combination of these
fields?

Then take Francisco's suggestion, only use an md5 of the organization
fields to create yourself a unique identifier.  Then you can use ctid
(unique internal identifier for each row) to join back.  You use SQL like
this:

SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM table;

Assume in example below that your unique "org" rows are the first 3 fields:

1. Load your data file into a loading table like so:
CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text,
person_name text);
\copy loader from 'my_data.csv' with csv header
ALTER TABLE loader ADD COLUMN org_id INT;

Example data:
INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');
INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');

2. Load org table:
test=# CREATE TABLE organizations (org_id serial primary key, org_name
text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4

3. Build mapping directly and update:
-- build hash of org fields in loader table, take ctid in order to map back
later
WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS
hash_identifier
FROM loader)

-- build hash of org fields in organizations table to join back to loader
and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS
hash_identifier
FROM organizations)

-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)

-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm
WHERE fm.ctid = l.ctid;

Final data ready for the person table to be populated:
test=# table organizations;
 org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
      1 | a        |             | b
      2 | a        |             |
      3 | a        |             | c
      4 | b        |             | c
(4 rows)

test=# table loader;
 org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
 a        |             |             | Jerry       |      2 |
 a        |             | b           | Bob         |      1 |
 a        |             | b           | Janice      |      1 |
 a        |             | c           | Chris       |      3 |
 b        |             | c           | Jason       |      4 |
 a        |             |             | Alice       |      2 |
(6 rows)

Hope this helps!
Thanks,
Jeremy

Re: Bulk inserts into two (related) tables

Reply via email to