text fields and performance for ETL

2021-11-03 Thread PG Doc comments form
The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/9.6/datatype-character.html
Description:

Text field is a field that is intended for very big texts. 
Performance within Postgres database is commented to be the same as for
normal varchar. 

But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size. 

So the conclusion would be to use an appropriate character length that
reflects max string size.


Re: text fields and performance for ETL

2021-11-03 Thread Bruce Momjian
On Wed, Nov  3, 2021 at 01:29:19PM +, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/9.6/datatype-character.html
> Description:
> 
> Text field is a field that is intended for very big texts. 
> Performance within Postgres database is commented to be the same as for
> normal varchar. 
> 
> But performance in ETL processes related to such data type is decreased
> dramatically, because any process that takes this kind of data needs to
> calculate its size on a row level and cannot take bigger chunks of data
> based on max size. 
> 
> So the conclusion would be to use an appropriate character length that
> reflects max string size.

I have no idea what you are saying above.  There is no additional
overhead for TEXT vs VARCHAR() in Postgres so it seems like an overhead
in applications.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: text fields and performance for ETL

2021-11-03 Thread David G. Johnston
On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form 
wrote:

> But performance in ETL processes related to such data type is decreased
> dramatically, because any process that takes this kind of data needs to
> calculate its size on a row level and cannot take bigger chunks of data
> based on max size.
>

All of my ETL simply reads in the entire contents of a text field.  There
is no chunking.  The documentation assumes that the sizes involved here are
reasonable for such behavior.  If you have a situation where you've chosen
to use varchar(n) and can defend that choice more power to you.  Those
special circumstances are not of particular interest here.  For the vast
majority of users they use varchar(n) because they (or more likely their
teachers) come from systems where it is required.  The goal in our docs is
to point out that using an arbitrary length specification is not required
in PostgreSQL.

David J.


Re: text fields and performance for ETL

2021-11-03 Thread David G. Johnston
On Wed, Nov 3, 2021 at 8:35 AM Grega Jesih  wrote:

> Dear David,
>
> > The goal in our docs is to point out that using an arbitrary length
> specification is not required in PostgreSQL.
>
> Then it should say so. Because unexperienced reader then uses this limited
> focus and generalizes.
>

It does (but maybe it could be improved, see the FAQ entry linked below for
more detail).

>
>
> It is for the very same reason that progreammers don't perceive the need
> to limit the string size to its realistic size that ETL processes
>
> are slowed down.
>

Given the number of people I find agreeing with "just use text" and the
general lack of people making good arguments for using varchar(n) I'm
inclined to believe the status quo best reflects the majority of usage in
the wild.

For example a currency where 3-char encoding is used, the field should be
> char(3) and not text.
>

The char data type has its own problems (see the other FAQ entry linked
below)

> > The documentation assumes that the sizes involved here are reasonable
> for such behavior
>
> On the contrary. When you say "performance is the same.." then you make a
> wrong impression it is an unversal case. But, if fact it depends.
>

As far as PostgreSQL itself is concerned there is no "it depends".  That is
all we are claiming here.  And, frankly, generalization is correct in
probably 90% of situations.  Maybe there is room for improvement, in
documentation that is usually the case, do you have a concrete suggestion
to offer?


> When you include such table into some dataflow from server A to server B,
> this process will get slowed down. I explained why.
> It is important to teach people that update the model to use logical
> values that make sense for such cases.
>

Teaching data modelling techniques isn't really a goal for our
documentation.  We aim to inform how the PostgreSQL works.


> IF you teach instead "ah it is not important, you may use text", then you
> actually ignore a part of informatic team that provides the usage of this
> data.
>

In most cases (how wide should a name field be) there is no good length to
choose.  For the currency abbreviation example I would add a "check
length(currency_name) = 3" constraint alongside a unique constraint on the
lookup table - but the data type would still just be text.  I can also
enforce all uppercase and letters only for the symbol in the formal
constraint while the char(3) will happily allow a value of "u6D".  In
either case the actual performance of processing that text field (input and
output) would be the same in PostgreSQL.  So if servers A and B are both
PostgreSQL you are simply incorrect.  If they are not then the example is
largely out-of-scope for our documentation.

 (I refer to SSIS in this context) that provide a very fast dataflow in
> case there is a known record size.
>

varchar(n) says nothing about the minimum size allowed which means it does
nothing to help for "known record size".  For that you need, and have,
actual constraints.

If anything the documentation lacks in making these points clear by the
very evidence of two FAQ entries covering the topic.

https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_varchar.28n.29_by_default
https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_char.28n.29

David J.


RE: text fields and performance for ETL

2021-11-03 Thread Grega Jesih
Dear David,
> The goal in our docs is to point out that using an arbitrary length 
> specification is not required in PostgreSQL.
Then it should say so. Because unexperienced reader then uses this limited 
focus and generalizes.

It is for the very same reason that progreammers don't perceive the need to 
limit the string size to its realistic size that ETL processes
are slowed down.

For example a currency where 3-char encoding is used, the field should be 
char(3) and not text.

> The documentation assumes that the sizes involved here are reasonable for 
> such behavior

On the contrary. When you say "performance is the same.." then you make a wrong 
impression it is an unversal case. But, if fact it depends.

When you include such table into some dataflow from server A to server B, this 
process will get slowed down. I explained why.
It is important to teach people that update the model to use logical values 
that make sense for such cases.

IF you teach instead "ah it is not important, you may use text", then you 
actually ignore a part of informatic team that provides the usage of this data.

With this you don't contribute. You create a problem.

> All of my ETL simply reads

Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) 
that provide a very fast dataflow in case there is a known record size.
In this case you go 100x faster as a rule of thumb. When dealing with billions 
of records, it makes a biiig difference.

Best regards
Grega

PS
I work in actual-it.si and gmail.com mail is fwded to me. So I took a shorcut 
here..





From: David G. Johnston 
Sent: Wednesday, November 3, 2021 3:38 PM
To: grega.je...@gmail.com; Pg Docs 
Subject: Re: text fields and performance for ETL

On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form 
mailto:nore...@postgresql.org>> wrote:
But performance in ETL processes related to such data type is decreased
dramatically, because any process that takes this kind of data needs to
calculate its size on a row level and cannot take bigger chunks of data
based on max size.

All of my ETL simply reads in the entire contents of a text field.  There is no 
chunking.  The documentation assumes that the sizes involved here are 
reasonable for such behavior.  If you have a situation where you've chosen to 
use varchar(n) and can defend that choice more power to you.  Those special 
circumstances are not of particular interest here.  For the vast majority of 
users they use varchar(n) because they (or more likely their teachers) come 
from systems where it is required.  The goal in our docs is to point out that 
using an arbitrary length specification is not required in PostgreSQL.

David J.


NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally 
privileged information and/or copyright material of Actual I.T. or third 
parties. If you are not an authorised recipient of this e-mail, please contact 
Actual I.T. immediately by return email or by telephone or facsimile on the 
above numbers.
You should not read, print, re-transmit, store or act in reliance on this email 
or any attachments and you should destroy all copies of them.


Re: text fields and performance for ETL

2021-11-03 Thread Tom Lane
Grega Jesih  writes:
>> The goal in our docs is to point out that using an arbitrary length 
>> specification is not required in PostgreSQL.

> Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) 
> that provide a very fast dataflow in case there is a known record size.

That's a matter for those tools to document, not us, because it's their
performance characteristics that you are talking about, not ours.

I'll just point out in passing, though, that "fixed record size" for text
strings is an illusion as soon as you have to deal with non-ASCII data.
So I'm skeptical that such optimizations are worth much anymore.

regards, tom lane




Re: text fields and performance for ETL

2021-11-03 Thread Grega Jesih

Dear Tom and David,

>I'll just point out in passing, though, that "fixed record size" for text
>strings is an illusion as soon as you have to deal with non-ASCII data.
>So I'm skeptical that such optimizations are worth much anymore.

It doesn't matter is it ascii or not.  A string is a string. An UTF-8 is also 
one, just differently coded.

It matters a lot. It means time saving. Plenty of time. So we're talking 
performance. Not postgres performance, interface performance.


The new architectures include more and more data exchange among databases.
Now when you deal with bigger data sizes that go from millions to billions, 
this fixed size vs of text - undefined size becomes very  relevant.

But what I am trying to achieve is that you describe a view local to database 
itself and and broader view, integration.

There is also the inner aspect of a database model where for a currency of true 
size 3 you choose char(3) instead of text.

You prevent a dummy insert on a database level. So no text len>3 may enter this 
field.


BR
Grega




Od: Tom Lane 
Poslano: 3. november 2021 18:37:28
Za: Grega Jesih
Kp: David G. Johnston; grega.je...@gmail.com; Pg Docs
Zadeva: Re: text fields and performance for ETL

Grega Jesih  writes:
>> The goal in our docs is to point out that using an arbitrary length 
>> specification is not required in PostgreSQL.

> Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) 
> that provide a very fast dataflow in case there is a known record size.

That's a matter for those tools to document, not us, because it's their
performance characteristics that you are talking about, not ours.

I'll just point out in passing, though, that "fixed record size" for text
strings is an illusion as soon as you have to deal with non-ASCII data.
So I'm skeptical that such optimizations are worth much anymore.

regards, tom lane

NOTICE - NOT TO BE REMOVED.
This e-mail and any attachments are confidential and may contain legally 
privileged information and/or copyright material of Actual I.T. or third 
parties. If you are not an authorised recipient of this e-mail, please contact 
Actual I.T. immediately by return email or by telephone or facsimile on the 
above numbers.
You should not read, print, re-transmit, store or act in reliance on this email 
or any attachments and you should destroy all copies of them.


Re: text fields and performance for ETL

2021-11-03 Thread David G. Johnston
On Wed, Nov 3, 2021 at 11:09 AM Grega Jesih 
wrote:

> The new architectures include more and more data exchange among databases.
> Now when you deal with bigger data sizes that go from millions to
> billions, this fixed size vs of text - undefined size becomes very
> relevant.
>
Can you demonstrate, with actual numbers, using today's implementation, a
situation where defining a column as char(3) or varchar(3) instead of text
has a significant performance improvement?  Without a concrete example to
examine I'm unable to be convinced to move away from the status quo.

You also need to convince me as to why constraints are an insufficient
feature.  i.e., why is char(3) better than (check length(val) = 3)?

Even with all that I'd probably still not do anything beyond reviewing a
proposed patch (i.e, I wouldn't try to write one myself from scratch...I
don't have authority to commit regardless).

David J.


Re: text fields and performance for ETL

2021-11-03 Thread Tom Lane
Grega Jesih  writes:
> It matters a lot. It means time saving. Plenty of time. So we're talking 
> performance. Not postgres performance, interface performance.

One more time: our docs are here to explain Postgres performance.
It is very easy to show that char/varchar are strictly worse than
text so far as Postgres is concerned.  Other tools need to document
their own performance considerations in their own docs.

Now, if you have a *semantic* consideration, like "a state abbreviation
should be exactly two characters", those datatypes might help you
with enforcing that.  But that's not a performance consideration.

regards, tom lane