Hi, I took a look, here are several suggestions for improvement: - Masking is not a main functionality of pg_dump and it is better to write most of the connected things in a separate file like parallel.c or dumputils.c. This will help slow down the growth of an already huge pg_dump file.
- Also it can be hard to use a lot of different functions for different fields, maybe it would be better to set up functions in a file. - How will it work for the same field and tables in the different schemas? Can we set up the exact schema for the field? - misspelling in a word >/* >* Add all columns and funcions to list of MaskColumnInfo structures, >*/ - Why did you use 256 here? > char* table = (char*) pg_malloc(256 * sizeof(char)); Also for malloc you need malloc on 1 symbol more because you have to store '\0' symbol. - Instead of addFuncToDatabase you can run your query using something already defined from fe_utils/query_utils.c. And It will be better to set up a connection only once and create all functions. Establishing a connection is a resource-intensive procedure. There are a lot of magic numbers, better to leave some comments explaining why there are 64 or 512. - It seems that you are not using temp_string > char *temp_string = (char*)malloc(256 * sizeof(char)); - Grammar issues >/* >* mask_column_info_list contains info about every to-be-masked column: >* its name, a name its table (if nothing is specified - mask all columns with this name), >* name of masking function and name of schema containing this function (public if not specified) >*/ the name of its table пн, 3 окт. 2022 г. в 20:45, Julien Rouhaud <rjuju...@gmail.com>: > Hi, > > On Mon, Oct 03, 2022 at 06:30:17PM +0300, Олег Целебровский wrote: > > > > Hello, here's my take on masking data when using pg_dump > > > > The main idea is using PostgreSQL functions to replace data during a > SELECT. > > When table data is dumped SELECT a,b,c,d ... from ... query is > generated, the columns that are marked for masking are replaced with result > of functions on those columns > > Example: columns name, count are to be masked, so the query will look as > such: SELECT id, mask_text(name), mask_int(count), date from ... > > > > So about the interface: I added 2 more command-line options: > > > > --mask-columns, which specifies what columns from what tables will be > masked > > usage example: > > --mask-columns "t1.name, t2.description" - both columns > will be masked with the same corresponding function > > or --mask-columns name - ALL columns with name "name" from > all dumped tables will be masked with correspoding function > > > > --mask-function, which specifies what functions will mask data > > usage example: > > --mask-function mask_int - corresponding columns will be > masked with function named "mask_int" from default schema (public) > > or --mask-function my_schema.mask_varchar - same as above > but with specified schema where the function is stored > > or --mask-function somedir/filename - the function is > "defined" here - more on the structure below > > FTR I wrote an extension POC [1] last weekend that does that but on the > backend > side. The main advantage is that it's working with any existing versions > of > pg_dump (or any client relying on COPY or even plain interactive SQL > statements), and that the DBA can force a dedicated role to only get a > masked > dump, even if they forgot to ask for it. > > I only had a quick look at your patch but it seems that you left some todo > in > russian, which isn't helpful at least to me. > > [1] https://github.com/rjuju/pg_anonymize > > >