[BUGS] BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

2012-07-18 Thread exclusion
The following bug has been logged on the website:

Bug reference:  6742
Logged by:  Alexander LAW
Email address:  exclus...@gmail.com
PostgreSQL version: 9.1.4
Operating system:   Windows
Description:

When I try to dump database with UTF-8 encoding in Windows, I get unreadable
object names.
Please look at the screenshot (http://oi50.tinypic.com/2lw6ipf.jpg). On the
left window all the pg_dump messages displayed correctly (except for the
prompt password (bug #6510)), but the non-ASCII object name is gibberish. On
the right window (where dump is done with the Windows 1251 encoding (OS
Encoding for Russian locale)) everything is right.

It seems that pg_dump doesn't do necessary encoding conversion for the
object names.
For example, there is a code in pg_dump.c:
write_msg(NULL, "finding the columns and types of table \"%s\"\n",
tbinfo->dobj.name);
or in backup_archiver.c
ahlog(AH, 1, "setting owner and privileges for %s %s\n",
te->desc, te->tag);

And then it comes to the following function in dumputils.c:
void
vwrite_msg(const char *modulename, const char *fmt, va_list ap)
{
...
vfprintf(stderr, _(fmt), ap);
}

So the format string goes through the translation and encoding conversion
(for the current OS locale), but tbinfo->dobj.name (or te->tag) does not.
I think it would be appropriate to convert all the obect names with some
function like dump_output_encoding_to_OS_encoding.

Best regards,
Alexander


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


[BUGS] Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

2012-07-18 Thread Thomas Kellerer

exclus...@gmail.com, 18.07.2012 09:17:

The following bug has been logged on the website:

Bug reference:  6742
Logged by:  Alexander LAW
Email address:  exclus...@gmail.com
PostgreSQL version: 9.1.4
Operating system:   Windows
Description:

When I try to dump database with UTF-8 encoding in Windows, I get unreadable
object names.
Please look at the screenshot (http://oi50.tinypic.com/2lw6ipf.jpg). On the
left window all the pg_dump messages displayed correctly (except for the
prompt password (bug #6510)), but the non-ASCII object name is gibberish. On
the right window (where dump is done with the Windows 1251 encoding (OS
Encoding for Russian locale)) everything is right.


Did you check the dump file using an editor that can handle UTF-8?
The Windows console is not known for properly handling that encoding.

Thomas




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


[BUGS] Re: BUG #6742: pg_dump doesn't convert encoding of DB object names to OS encoding

2012-07-18 Thread Alexander Law

Hello,

The dump file itself is correct. The issue is only with the non-ASCII 
object names in pg_dump messages.
The messages text (which is non-ASCII too) displayed consistently with 
right encoding (i.e. with OS encoding thanks to libintl/gettext), but 
encoding of db object names depends on the dump encoding and thus 
they're getting unreadable when different encoding is used.
The same can be reproduced in Linux (where console encoding is UTF-8) 
when doing dump with Windows-1251 or Latin1 (for western european 
languages).


Thanks,
Alexander


   The following bug has been logged on the website:

   Bug reference:  6742
   Logged by:  Alexander LAW
   Email address:  exclusion(at)gmail(dot)com
   PostgreSQL version: 9.1.4
   Operating system:   Windows
   Description:

   When I try to dump database with UTF-8 encoding in Windows, I get unreadable
   object names.
   Please look at the screenshot (http://oi50.tinypic.com/2lw6ipf.jpg). On the
   left window all the pg_dump messages displayed correctly (except for the
   prompt password (bug #6510)), but the non-ASCII object name is gibberish. On
   the right window (where dump is done with the Windows 1251 encoding (OS
   Encoding for Russian locale)) everything is right.

Did you check the dump file using an editor that can handle UTF-8?
The Windows console is not known for properly handling that encoding.

Thomas






Re: [BUGS] main log encoding problem

2012-07-18 Thread Alexander Law

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create any 
databases they wish.
I have clients all over the world (so they can create databases with 
different encoding).


The question is - what I (as admin) want to see in my postgresql log, 
containing errors from all the databases?

IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second, 
It should be useful and complete as possible.


Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by 
onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for each 
file.


B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So at 
best I will get ??? in the log.


C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any 
generic editor/viewer to open it.

Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some 
conversation function.


I think that the last solution is the solution. What is your opinion?

In fact the problem exists even with a simple installation on Windows 
when you use non-English locale.

So the solution would be useful for many of us.

Best regards,
Alexander

P.S. sorry for the wrong subject in my previous message sent to 
pgsql-general



On 05/23/2012 09:15 AM, yi huang wrote:

I'm using postgresql 9.1.3 from debian squeeze-backports with
zh_CN.UTF-8 locale, i find my main log (which is
"/var/log/postgresql/postgresql-9.1-main.log") contains "???" which
indicate some sort of charset encoding problem.


It's a known issue, I'm afraid. The PostgreSQL postmaster logs in the
system locale, and the PostgreSQL backends log in whatever encoding
their database is in. They all write to the same log file, producing a
log file full of mixed encoding data that'll choke many text editors.

If you force your editor to re-interpret the file according to the
encoding your database(s) are in, this may help.

In the future it's possible that this may be fixed by logging output to
different files on a per-database basis or by converting the text
encoding of log messages, but no agreement has been reached on the
correct approach and nobody has stepped up to implement it.

--
Craig Ringer


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] BUG #6733: All Tables Empty After pg_upgrade (PG 9.2.0 beta 2)

2012-07-18 Thread Mike Wilson
Tom, after patching pg_upgrade now runs successfully.  I noticed that this 
patch had been applied since yesterday to the REL9_2_STABLE so I also tested 
with a git pull without the patch that appears to work also.  I think issue has 
been resolved for me, thanks so much!  You guys rock!

Mike Wilson
mfwil...@gmail.com



On Jul 17, 2012, at 9:31 PM, Tom Lane wrote:

> Bruce Momjian  writes:
>> I am using git head for testing.  Tom sees a few things odd in
>> load_directory() that might be causing some problems on Solaris, and
>> this is new code for 9.2 for Solaris, so that might explain it.  I think
>> we need Tom to finish and then if you can grab our git source and test
>> that, it would be great!
> 
> The only thing I see that looks likely to represent a platform-specific
> issue is the entrysize calculation.  Mike, just out of curiosity, could
> you see if the attached patch makes things better for you?
> 
>   regards, tom lane
> 
> diff --git a/contrib/pg_upgrade/file.c b/contrib/pg_upgrade/file.c
> index 
> 1dd3722142c9e83c1ec228099c3a3fd302a2179b..c886a67df43792a1692eec6b3b90238413e9f844
>  100644
> *** a/contrib/pg_upgrade/file.c
> --- b/contrib/pg_upgrade/file.c
> *** load_directory(const char *dirname, stru
> *** 259,265 
>   return -1;
>   }
> 
> ! entrysize = sizeof(struct dirent) - sizeof(direntry->d_name) +
>   strlen(direntry->d_name) + 1;
> 
>   (*namelist)[name_num] = (struct dirent *) malloc(entrysize);
> --- 259,265 
>   return -1;
>   }
> 
> ! entrysize = offsetof(struct dirent, d_name) +
>   strlen(direntry->d_name) + 1;
> 
>   (*namelist)[name_num] = (struct dirent *) malloc(entrysize);


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] main log encoding problem

2012-07-18 Thread Tatsuo Ishii
> C. We have one logfile with UTF-8.
> Pros: Log messages of all our clients can fit in it. We can use any
> generic editor/viewer to open it.
> Nothing changes for Linux (and other OSes with UTF-8 encoding).
> Cons: All the strings written to log file should go through some
> conversation function.
> 
> I think that the last solution is the solution. What is your opinion?

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
   table is required. Also no information loss happens in this
   conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
   popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
   information about language.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] main log encoding problem

2012-07-18 Thread Tom Lane
Tatsuo Ishii  writes:
> My idea is using mule-internal encoding for the log file instead of
> UTF-8. There are several advantages:

> 1) Converion to mule-internal encoding is cheap because no conversion
>table is required. Also no information loss happens in this
>conversion.

> 2) Mule-internal encoding can be handled by emacs, one of the most
>popular editors in the world.

> 3) No need to worry about locale. Mule-internal encoding has enough
>information about language.

Um ... but ...

(1) nothing whatsoever can read MULE, except emacs and xemacs.

(2) there is more than one version of MULE (emacs versus xemacs,
not to mention any possible cross-version discrepancies).

(3) from a log volume standpoint, this could be pretty disastrous.

I'm not for a write-only solution, which is pretty much what this
would be.

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] main log encoding problem

2012-07-18 Thread Tatsuo Ishii
> Tatsuo Ishii  writes:
>> My idea is using mule-internal encoding for the log file instead of
>> UTF-8. There are several advantages:
> 
>> 1) Converion to mule-internal encoding is cheap because no conversion
>>table is required. Also no information loss happens in this
>>conversion.
> 
>> 2) Mule-internal encoding can be handled by emacs, one of the most
>>popular editors in the world.
> 
>> 3) No need to worry about locale. Mule-internal encoding has enough
>>information about language.
> 
> Um ... but ...
> 
> (1) nothing whatsoever can read MULE, except emacs and xemacs.
> 
> (2) there is more than one version of MULE (emacs versus xemacs,
> not to mention any possible cross-version discrepancies).
> 
> (3) from a log volume standpoint, this could be pretty disastrous.
> 
> I'm not for a write-only solution, which is pretty much what this
> would be.

I'm not sure how long xemacs will survive (the last stable release of
xemacs was released in 2009). Anyway, I'm not too worried about your
points, since it's easy to convert back from mule-internal code
encoded log files to original encoding mixed log file. No information
will be lost. Even converting to UTF-8 should be possible. My point
is, once the log file is converted to UTF-8, there's no way to convert
back to original encoding log file.

Probably we treat mule-internal encoded log files as an internal
format, and have a utility which does conversion from mule-internal to
UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] main log encoding problem

2012-07-18 Thread Craig Ringer

On 07/18/2012 11:16 PM, Alexander Law wrote:

Hello!

May I to propose a solution and to step up?

I've read a discussion of the bug #5800 and here is my 2 cents.
To make things clear let me give an example.
I am a PostgreSQL hosting provider and I let my customers to create 
any databases they wish.
I have clients all over the world (so they can create databases with 
different encoding).


The question is - what I (as admin) want to see in my postgresql log, 
containing errors from all the databases?

IMHO we should consider two requirements for the log.
First, The file should be readable with a generic text viewer. Second, 
It should be useful and complete as possible.


Now I see following solutions.
A. We have different logfiles for each database with different encodings.
Then all our logs will be readable, but we have to look at them one by 
onе and it's inconvenient at least.
Moreover, our log reader should understand what encoding to use for 
each file.


B. We have one logfile with the operating system encoding.
First downside is that the logs can be different for different OSes.
The second is that Windows has non-Unicode system encoding.
And such an encoding can't represent all the national characters. So 
at best I will get ??? in the log.


C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any 
generic editor/viewer to open it.

Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some 
conversation function.


I think that the last solution is the solution. What is your opinion?


Implementing any of these isn't trivial - especially making sure 
messages emitted to stderr from things like segfaults and dynamic linker 
messages are always correct. Ensuring that the logging collector knows 
when setlocale() has been called to change the encoding and translation 
of system messages, handling the different logging output methods, etc - 
it's going to be fiddly.


I have some performance concerns about the transcoding required for (b) 
or (c), but realistically it's already the norm to convert all the data 
sent to and from clients. Conversion for logging should not be a 
significant additional burden. Conversion can be short-circuited out 
when source and destination encodings are the same for the common case 
of logging in utf-8 or to a dedicated file.


I suspect the eventual choice will be "all of the above":

- Default to (b) or (c), both have pros and cons. I favour (c) with a 
UTF-8 BOM to warn editors, but (b) is nice for people whose DBs are all 
in the system locale.


- Allow (a) for people who have many different DBs in many different 
encodings, do high volume logging, and want to avoid conversion 
overhead. Let them deal with the mess, just provide an additional % code 
for the encoding so they can name their per-DB log files to indicate the 
encoding.


The main issue is just that code needs to be prototyped, cleaned up, and 
submitted. So far nobody's cared enough to design it, build it, and get 
it through patch review. I've just foolishly volunteered myself to work 
on an automated crash-test system for virtual plug-pull testing, so I'm 
not stepping up.


--
Craig Ringer



--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


[BUGS] BUG #6743: BETWEEN operator does not work for char(1)

2012-07-18 Thread spatarel1
The following bug has been logged on the website:

Bug reference:  6743
Logged by:  Spătărel Dan
Email address:  spatar...@yahoo.com
PostgreSQL version: 9.1.4
Operating system:   Windows Vista SP2
Description:

I use "UTF8" charset and "Romania, Romanian" locale.

I came across this as I wanted to test if a symbol was a letter:

SELECT 'a' BETWEEN 'a' AND 'z'; -- true
SELECT 'z' BETWEEN 'a' AND 'z'; -- true
SELECT 'A' BETWEEN 'a' AND 'z'; -- true
SELECT 'Z' BETWEEN 'a' AND 'z'; -- false (!)
SELECT 'a' BETWEEN 'A' AND 'Z'; -- false (!)
SELECT 'z' BETWEEN 'A' AND 'Z'; -- true
SELECT 'A' BETWEEN 'A' AND 'Z'; -- true
SELECT 'Z' BETWEEN 'A' AND 'Z'; -- true

It seems that the intent is for the comparison to be case-insensitive, but
in some limit-cases it fails.

Please let me know if this turns out to be a real bug on not.


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] main log encoding problem

2012-07-18 Thread Alexander Law

Hello,


C. We have one logfile with UTF-8.
Pros: Log messages of all our clients can fit in it. We can use any
generic editor/viewer to open it.
Nothing changes for Linux (and other OSes with UTF-8 encoding).
Cons: All the strings written to log file should go through some
conversation function.

I think that the last solution is the solution. What is your opinion?

I am thinking about variant of C.

Problem with C is, converting from other encoding to UTF-8 is not
cheap because it requires huge conversion tables. This may be a
serious problem with busy server. Also it is possible some information
is lossed while in this conversion. This is because there's no
gualntee that there is one-to-one-mapping between UTF-8 and other
encodings. Other problem with UTF-8 is, you have to choose *one*
locale when using your editor. This may or may not affect handling of
string in your editor.

My idea is using mule-internal encoding for the log file instead of
UTF-8. There are several advantages:

1) Converion to mule-internal encoding is cheap because no conversion
table is required. Also no information loss happens in this
conversion.

2) Mule-internal encoding can be handled by emacs, one of the most
popular editors in the world.

3) No need to worry about locale. Mule-internal encoding has enough
information about language.
--

I believe that postgres has such conversion functions anyway. And they 
used for data conversion when we have clients (and databases) with 
different encodings. So if they can be used for data, why not to use 
them for relatively little amount of log messages?
And regarding mule internal encoding - reading about Mule 
http://www.emacswiki.org/emacs/UnicodeEncoding I found:
/In future (probably Emacs 22), Mule will use an internal encoding which 
is a UTF-8 encoding of a superset of Unicode. /

So I still see UTF-8 as a common denominator for all the encodings.
I am not aware of any characters absent in Unicode. Can you please 
provide some examples of these that can results in lossy conversion?
?hoosing UTF-8 in a viewer/editor is no big deal too. Most of them 
detect UTF-8 automagically, and for the others BOM can be added.


Best regards,
Aexander