OOPS, I forgot to mention in the SELECT generating the output file that the e'\t' generates a "tab" character. You likely already know this, but I like to make my posts as self contained as possible.
On Sun, Feb 15, 2015 at 10:00 AM, John McKown <john.archie.mck...@gmail.com> wrote: > On Sat, Feb 14, 2015 at 3:54 PM, Eugene Dzhurinsky <jdeve...@gmail.com> > wrote: > >> Hello! >> >> I have a huge dictionary table with series data generated by a third-party >> service. The table consists of 2 columns >> >> - id : serial, primary key >> - series : varchar, not null, indexed >> >> From time to time I need to apply a "patch" to the dictionary, the patch >> file >> consists of "series" data, one per line. >> >> Now I need to import the patch into the database, and produce another >> file as >> - if the passed "series" field exists in the database, then return >> ID:series >> - otherwise insert a new row to the table and generate new ID and return >> ID:series >> for each row in the source file. >> >> So the new file will contain both ID and series data, separated by tab or >> something. >> >> While reading and writing the data is not a question (I simply described >> the >> whole task just in case), I wonder what is the most efficient way of >> importing >> such a data into a table, keeping in mind that >> >> - the dictionary table already consists of ~200K records >> - the patch could be ~1-50K of records long >> - records could not be removed from the dictionary, only added if not >> exist >> >> Thanks! >> >> -- >> Eugene Dzhurinsky >> > > I was hoping that you'd get a good reply from someone else. But it is the > weekend. So I will _try_ to explain what I would try. You can see if it is > of any use to you. Sometimes my ideas are really good. And other times they > are, well let's be nice and say "not as good as other times" [grin/]. I > can't test the below because I don't have any data. But hopefully it will > be close and even of some help to you. > > The first thing that I would do is put the input patch data into its own > table. Perhaps a temporary table, or even a permanent one. > > DROP TABLE IF EXISTS patch_data; > CREATE TABLE patch_data ( > input_record_number SERIAL, > already_exists BOOLEAN DEFAULT FALSE; > id INTEGER, > series TEXT NOT NULL ); > > I don't know if the order of the records in output file need to match > the order of the records in the input file. If not, they you don't need the > "input_record_number" field. Otherwise, that is used to maintain the order > of the input. At this point, you can load the input file with an SQL > command similar to: > > COPY patch_data (series) FROM input-file.txt; > > Now update the path_data from the existing dictionary table to see which > "series" data already exists. > > UPDATE patch_data SET already_exists=((SELECT TRUE FROM dictionary WHERE > dictionary.series = patch_data.series)); > > At this point, the table patch_data has been updated such that if the > series data in it already exists, the "already_exists" column is now TRUE > instead of the initial FALSE. This means that we need to insert all the > series data in "patch_data" which does not exist in "dictionary" ( i.e. > "already_exists" is FALSE in "patch_data") into "dictionary". > > INSERT INTO dictionary(series) SELECT series FROM patch_data WHERE > already_exists = FALSE; > > The above should insert the "series" records which don't exist in > "dictionary" into it and generate an "id" column for it from your SERIAL > definition. Now we need to find out the "id" values for each of the > "series" values and return them. > > UPDATE patch_data SET id=((SELECT id FROM dictionary WHERE > dictionary.series = patch_data.series)); > > At this point, if I have not messed up, every "series" value in > "patch_data" should have the proper "id" value from "dictionary". So the > table "patch_data" should now have all that we need in it. So we just use > it to make our output file. I don't know you're output requirements, but > I'd just do something like: > > SELECT id, e'\t',series FROM patch_data ORDER BY input_record_number; > > Again, if the output order does not need to be the same as the input > order, then all the stuff with the "input_record_number" column can be > eliminated. Just to be a good citizen: > > DROP TABLE patch_data; > > at this point. Unless, you think you might need it for some reason. Who > knows, you might even want to archive it as some sort of audit information. > > > -- > He's about as useful as a wax frying pan. > > 10 to the 12th power microphones = 1 Megaphone > > Maranatha! <>< > John McKown > -- He's about as useful as a wax frying pan. 10 to the 12th power microphones = 1 Megaphone Maranatha! <>< John McKown