text fields and performance for ETL
The following documentation comment has been logged on the website: Page: https://www.postgresql.org/docs/9.6/datatype-character.html Description: Text field is a field that is intended for very big texts. Performance within Postgres database is commented to be the same as for normal varchar. But performance in ETL processes related to such data type is decreased dramatically, because any process that takes this kind of data needs to calculate its size on a row level and cannot take bigger chunks of data based on max size. So the conclusion would be to use an appropriate character length that reflects max string size.
Re: text fields and performance for ETL
On Wed, Nov 3, 2021 at 01:29:19PM +, PG Doc comments form wrote: > The following documentation comment has been logged on the website: > > Page: https://www.postgresql.org/docs/9.6/datatype-character.html > Description: > > Text field is a field that is intended for very big texts. > Performance within Postgres database is commented to be the same as for > normal varchar. > > But performance in ETL processes related to such data type is decreased > dramatically, because any process that takes this kind of data needs to > calculate its size on a row level and cannot take bigger chunks of data > based on max size. > > So the conclusion would be to use an appropriate character length that > reflects max string size. I have no idea what you are saying above. There is no additional overhead for TEXT vs VARCHAR() in Postgres so it seems like an overhead in applications. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: text fields and performance for ETL
On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form wrote: > But performance in ETL processes related to such data type is decreased > dramatically, because any process that takes this kind of data needs to > calculate its size on a row level and cannot take bigger chunks of data > based on max size. > All of my ETL simply reads in the entire contents of a text field. There is no chunking. The documentation assumes that the sizes involved here are reasonable for such behavior. If you have a situation where you've chosen to use varchar(n) and can defend that choice more power to you. Those special circumstances are not of particular interest here. For the vast majority of users they use varchar(n) because they (or more likely their teachers) come from systems where it is required. The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL. David J.
Re: text fields and performance for ETL
On Wed, Nov 3, 2021 at 8:35 AM Grega Jesih wrote: > Dear David, > > > The goal in our docs is to point out that using an arbitrary length > specification is not required in PostgreSQL. > > Then it should say so. Because unexperienced reader then uses this limited > focus and generalizes. > It does (but maybe it could be improved, see the FAQ entry linked below for more detail). > > > It is for the very same reason that progreammers don't perceive the need > to limit the string size to its realistic size that ETL processes > > are slowed down. > Given the number of people I find agreeing with "just use text" and the general lack of people making good arguments for using varchar(n) I'm inclined to believe the status quo best reflects the majority of usage in the wild. For example a currency where 3-char encoding is used, the field should be > char(3) and not text. > The char data type has its own problems (see the other FAQ entry linked below) > > The documentation assumes that the sizes involved here are reasonable > for such behavior > > On the contrary. When you say "performance is the same.." then you make a > wrong impression it is an unversal case. But, if fact it depends. > As far as PostgreSQL itself is concerned there is no "it depends". That is all we are claiming here. And, frankly, generalization is correct in probably 90% of situations. Maybe there is room for improvement, in documentation that is usually the case, do you have a concrete suggestion to offer? > When you include such table into some dataflow from server A to server B, > this process will get slowed down. I explained why. > It is important to teach people that update the model to use logical > values that make sense for such cases. > Teaching data modelling techniques isn't really a goal for our documentation. We aim to inform how the PostgreSQL works. > IF you teach instead "ah it is not important, you may use text", then you > actually ignore a part of informatic team that provides the usage of this > data. > In most cases (how wide should a name field be) there is no good length to choose. For the currency abbreviation example I would add a "check length(currency_name) = 3" constraint alongside a unique constraint on the lookup table - but the data type would still just be text. I can also enforce all uppercase and letters only for the symbol in the formal constraint while the char(3) will happily allow a value of "u6D". In either case the actual performance of processing that text field (input and output) would be the same in PostgreSQL. So if servers A and B are both PostgreSQL you are simply incorrect. If they are not then the example is largely out-of-scope for our documentation. (I refer to SSIS in this context) that provide a very fast dataflow in > case there is a known record size. > varchar(n) says nothing about the minimum size allowed which means it does nothing to help for "known record size". For that you need, and have, actual constraints. If anything the documentation lacks in making these points clear by the very evidence of two FAQ entries covering the topic. https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_varchar.28n.29_by_default https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_char.28n.29 David J.
RE: text fields and performance for ETL
Dear David, > The goal in our docs is to point out that using an arbitrary length > specification is not required in PostgreSQL. Then it should say so. Because unexperienced reader then uses this limited focus and generalizes. It is for the very same reason that progreammers don't perceive the need to limit the string size to its realistic size that ETL processes are slowed down. For example a currency where 3-char encoding is used, the field should be char(3) and not text. > The documentation assumes that the sizes involved here are reasonable for > such behavior On the contrary. When you say "performance is the same.." then you make a wrong impression it is an unversal case. But, if fact it depends. When you include such table into some dataflow from server A to server B, this process will get slowed down. I explained why. It is important to teach people that update the model to use logical values that make sense for such cases. IF you teach instead "ah it is not important, you may use text", then you actually ignore a part of informatic team that provides the usage of this data. With this you don't contribute. You create a problem. > All of my ETL simply reads Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) that provide a very fast dataflow in case there is a known record size. In this case you go 100x faster as a rule of thumb. When dealing with billions of records, it makes a biiig difference. Best regards Grega PS I work in actual-it.si and gmail.com mail is fwded to me. So I took a shorcut here.. From: David G. Johnston Sent: Wednesday, November 3, 2021 3:38 PM To: grega.je...@gmail.com; Pg Docs Subject: Re: text fields and performance for ETL On Wed, Nov 3, 2021 at 6:55 AM PG Doc comments form mailto:nore...@postgresql.org>> wrote: But performance in ETL processes related to such data type is decreased dramatically, because any process that takes this kind of data needs to calculate its size on a row level and cannot take bigger chunks of data based on max size. All of my ETL simply reads in the entire contents of a text field. There is no chunking. The documentation assumes that the sizes involved here are reasonable for such behavior. If you have a situation where you've chosen to use varchar(n) and can defend that choice more power to you. Those special circumstances are not of particular interest here. For the vast majority of users they use varchar(n) because they (or more likely their teachers) come from systems where it is required. The goal in our docs is to point out that using an arbitrary length specification is not required in PostgreSQL. David J. NOTICE - NOT TO BE REMOVED. This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright material of Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T. immediately by return email or by telephone or facsimile on the above numbers. You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroy all copies of them.
Re: text fields and performance for ETL
Grega Jesih writes: >> The goal in our docs is to point out that using an arbitrary length >> specification is not required in PostgreSQL. > Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) > that provide a very fast dataflow in case there is a known record size. That's a matter for those tools to document, not us, because it's their performance characteristics that you are talking about, not ours. I'll just point out in passing, though, that "fixed record size" for text strings is an illusion as soon as you have to deal with non-ASCII data. So I'm skeptical that such optimizations are worth much anymore. regards, tom lane
Re: text fields and performance for ETL
Dear Tom and David, >I'll just point out in passing, though, that "fixed record size" for text >strings is an illusion as soon as you have to deal with non-ASCII data. >So I'm skeptical that such optimizations are worth much anymore. It doesn't matter is it ascii or not. A string is a string. An UTF-8 is also one, just differently coded. It matters a lot. It means time saving. Plenty of time. So we're talking performance. Not postgres performance, interface performance. The new architectures include more and more data exchange among databases. Now when you deal with bigger data sizes that go from millions to billions, this fixed size vs of text - undefined size becomes very relevant. But what I am trying to achieve is that you describe a view local to database itself and and broader view, integration. There is also the inner aspect of a database model where for a currency of true size 3 you choose char(3) instead of text. You prevent a dummy insert on a database level. So no text len>3 may enter this field. BR Grega Od: Tom Lane Poslano: 3. november 2021 18:37:28 Za: Grega Jesih Kp: David G. Johnston; grega.je...@gmail.com; Pg Docs Zadeva: Re: text fields and performance for ETL Grega Jesih writes: >> The goal in our docs is to point out that using an arbitrary length >> specification is not required in PostgreSQL. > Well perhaps yours. But there are pro-tools (I refer to SSIS in this context) > that provide a very fast dataflow in case there is a known record size. That's a matter for those tools to document, not us, because it's their performance characteristics that you are talking about, not ours. I'll just point out in passing, though, that "fixed record size" for text strings is an illusion as soon as you have to deal with non-ASCII data. So I'm skeptical that such optimizations are worth much anymore. regards, tom lane NOTICE - NOT TO BE REMOVED. This e-mail and any attachments are confidential and may contain legally privileged information and/or copyright material of Actual I.T. or third parties. If you are not an authorised recipient of this e-mail, please contact Actual I.T. immediately by return email or by telephone or facsimile on the above numbers. You should not read, print, re-transmit, store or act in reliance on this email or any attachments and you should destroy all copies of them.
Re: text fields and performance for ETL
On Wed, Nov 3, 2021 at 11:09 AM Grega Jesih wrote: > The new architectures include more and more data exchange among databases. > Now when you deal with bigger data sizes that go from millions to > billions, this fixed size vs of text - undefined size becomes very > relevant. > Can you demonstrate, with actual numbers, using today's implementation, a situation where defining a column as char(3) or varchar(3) instead of text has a significant performance improvement? Without a concrete example to examine I'm unable to be convinced to move away from the status quo. You also need to convince me as to why constraints are an insufficient feature. i.e., why is char(3) better than (check length(val) = 3)? Even with all that I'd probably still not do anything beyond reviewing a proposed patch (i.e, I wouldn't try to write one myself from scratch...I don't have authority to commit regardless). David J.
Re: text fields and performance for ETL
Grega Jesih writes: > It matters a lot. It means time saving. Plenty of time. So we're talking > performance. Not postgres performance, interface performance. One more time: our docs are here to explain Postgres performance. It is very easy to show that char/varchar are strictly worse than text so far as Postgres is concerned. Other tools need to document their own performance considerations in their own docs. Now, if you have a *semantic* consideration, like "a state abbreviation should be exactly two characters", those datatypes might help you with enforcing that. But that's not a performance consideration. regards, tom lane