RE: text fields and performance for ETL

Grega Jesih Wed, 03 Nov 2021 10:24:57 -0700

Dear David,
> The goal in our docs is to point out that using an arbitrary length 
> specification is not required in PostgreSQL.
Then it should say so. Because unexperienced reader then uses this limited 
focus and generalizes.

It is for the very same reason that progreammers don't perceive the need to 
limit the string size to its realistic size that ETL processes
are slowed down.

For example a currency where 3-char encoding is used, the field should be 
char(3) and not text.

> The documentation assumes that the sizes involved here are reasonable for 
> such behavior

On the contrary. When you say "performance is the same.." then you make a wrong 
impression it is an unversal case. But, if fact it depends.

When you include such table into some dataflow from server A to server B, this 
process will get slowed down. I explained why.
It is important to teach people that update the model to use logical values 
that make sense for such cases.

IF you teach instead "ah it is not important, you may use text", then you 
actually ignore a part of informatic team that provides the usage of this data.

With this you don't contribute. You create a problem.

> All of my ETL simply reads

Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) 
that provide a very fast dataflow in case there is a known record size.
In this case you go 100x faster as a rule of thumb. When dealing with billions 
of records, it makes a biiig difference.

Best regards
Grega

PS
I work in actual-it.si and gmail.com mail is fwded to me. So I took a shorcut 
here..

From: David G. Johnston <david.g.johns...@gmail.com>
Sent: Wednesday, November 3, 2021 3:38 PM
To: grega.je...@gmail.com; Pg Docs <pgsql-docs@lists.postgresql.org>
Subject: Re: text fields and performance for ETL

On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form 
<nore...@postgresql.org<mailto:nore...@postgresql.org>> wrote:
But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size.

All of my ETL simply reads in the entire contents of a text field.  There is no 
chunking.  The documentation assumes that the sizes involved here are 
reasonable for such behavior.  If you have a situation where you've chosen to 
use varchar(n) and can defend that choice more power to you.  Those special 
circumstances are not of particular interest here.  For the vast majority of 
users they use varchar(n) because they (or more likely their teachers) come 
from systems where it is required.  The goal in our docs is to point out that 
using an arbitrary length specification is not required in PostgreSQL.

David J.

NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally 
privileged information and/or copyright material of Actual I.T. or third 
parties. If you are not an authorised recipient of this e-mail, please contact 
Actual I.T. immediately by return email or by telephone or facsimile on the 
above numbers.
You should not read, print, re-transmit, store or act in reliance on this email 
or any attachments and you should destroy all copies of them.

RE: text fields and performance for ETL

Reply via email to