Re: [HACKERS] Radix tree for character conversion

Kyotaro HORIGUCHI Mon, 05 Dec 2016 02:31:49 -0800

Hello, thank you for reviewing this.

I compared mine and yours. The new patch works fine and gives
smaller radix map files. It seems also to me more readable.

At Fri, 2 Dec 2016 22:07:07 +0200, Heikki Linnakangas <[email protected]> wrote 
in <[email protected]>
> On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote:
> > Thanks. The attached patch contains the patch by perlcritic.
> >
> > 0001,2,3 are Heikki's patch that are not modified since it is
> > first proposed. It's a bit too big so I don't attach them to this
> > mail (again).
> >
> > https://www.postgresql.org/message-id/[email protected]
> 
> I've now pushed these preliminary patches, with the applicable fixes
> from you and Daniel. The attached patch is now against git master.

Thanks for committing them.

> > 0004 is radix-tree stuff, applies on top of the three patches
> > above.
> 
> I've spent the last couple of days reviewing this. While trying to
> understand how it works, I ended up dismantling, rewriting, and
> putting back together most of the added perl code. 

I might have been putting too much in one structure and a bit too
eager to conceal lower level from the upper level.

> Attached is a new
> version, with more straightforward logic, making it more
> understandable. I find it more understandable, anyway, I hope it's not
> only because I wrote it myself :-). Let me know what you think.

First, thank you for the refactoring(?).

I didn't intend to replace all of .map files with radix files at
first. Finally my patch removes all old-style map file but I
haven't noticed that. So removing the old bsearch code seems
reasonable. Avoiding redundant decomposition of multibyte
characters into bytes seems reasonable from the view of
efficiency.

The new patch decomposes the structured pg_mb_radix_tree into a
series of (basically) plain member variables in a struct. I'm not
so in favor of the style but a radix tree of at least 4-levels is
easily read in the style and maybe the code to handle it is
rather easily readable. (So +1 for it)

> In particular, I found the explanations of flat and segmented tables
> really hard to understand. So in this version, the radix trees for a
> conversion are stored completely in one large array. Leaf and
> intermediate levels are all in the same array. When reading this
> version, please note that I'm not sure if I mean the same thing with
> "segment" that you did in your version.
> I moved the "lower" and "upper" values in the structs. Also, there are
> now also separate "lower" and "upper" values for the leaf levels of
> the trees, for 1- 2-, 3- and 4-byte inputs. This made a huge

The "segment" there seems to mean definitely the same to
mine. Flattening the on-memory structure is fine from the same
reason to the above.

> difference to the size of gb18030_to_utf8_radix.map, in particular:
> the source file shrank from about 2 MB to 1/2 MB. In that conversion,
> the valid range for the last byte of 2-byte inputs is 0x40-0xfe, and
> the valid range for the last byte of 4-byte inputs is 0x30-0x39. With
> the old patch version, the "chars" range was therefore 0x30-0xfe, to
> cover both of those, and most of the array was filled with zeros. With
> this new patch version, we store separate ranges for those, and can
> leave out most of the zeros.

Great. I agree that the (logically) devided chartable is
significantly space-efficient.

> There's a segment full of zeros at the beginning of each conversion
> array now. The purpose of that is that when traversing the radix tree,
> you don't need to check each intermediate value for 0. If you follow a
> 0 offset, it simply points to the dummy all-zeros segments in the
> beginning. Seemed like a good idea to shave some cycles, although I'm
> not sure if it made much difference in reality.

And I like the zero page.

> I optimized pg_mb_radix_conv() a bit, too. We could do more. For
> example, I think it would save some cycles to have specialized
> versions of UtfToLocal and LocalToUtf, moving the tests for whether a
> combined character map and/or conversion callback is used, out of the
> loop. They feel a bit ugly too, in their current form...

Hmm. Maybe decomposing iiso in pg_mb_radix_conv is faster than
pushing extra 3 (or 4) parameters into the stack (or it's wrong
if they are passed using registers?) but I'm not sure.

> I need a break now, but I'll try to pick this up again some time next
> week. Meanwhile, please have a look and tell me what you think.

Thank you very much for the big effort on this patch.

Apart from the aboves, I have some trivial comments on the new
version.

1. If we decide not to use old-style maps, UtfToLocal no longer
  need to take void * as map data. (Patch 0001)

2. "use Data::Dumper" doesn't seem necessary. (Patch 0002)

3. A comment contains a superfluous comma. (Patch 0002) (The last
   byte of the first line below)
 > ### The segments are written out physically to one big array in the final,
 > ### step, but logically, they form a radix tree. Or rather, four radix

4. The following code doesn't seem so perl'ish.

   >  for (my $i=0; $i <= 0xff; $i++)
   >  {
   >    my $val = $seg->{values}->{$i};
   >    if ($val)
   >    {
   >      $this_min = $i if (!defined $this_min || $i < $this_min);

   Refraining from proposing extreme perl, the following would be
   reasonable as an equivalent. (Patch 0002)

   foreach $i (keys $seg->{values})
   {

   The 0002 patch contains the following change but this might be
   kinda extreme..

   -    for (my $i=0; $i <= 0xff; $i++)
   +    while ((my $i, my $val) = each $map)

4. download_srctxts.sh is no longer needed. (No patch)

I'll put more consideration on the new version and put another
version later.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

>From 3f6153d2cbabb3c964aa0216c5dd866a159d2fb8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[email protected]>
Date: Mon, 5 Dec 2016 17:54:27 +0900
Subject: [PATCH 1/2] Change type for map files as the parameter of UtfToLocal

Each UtfToLocal and LocalToUtf takes map files in definitely one
struct type. They no longer need to be void*.
---
 src/backend/utils/mb/conv.c | 14 ++++++--------
 src/include/mb/pg_wchar.h   |  8 ++++----
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/src/backend/utils/mb/conv.c b/src/backend/utils/mb/conv.c
index 8c0563c..f850487 100644
--- a/src/backend/utils/mb/conv.c
+++ b/src/backend/utils/mb/conv.c
@@ -466,8 +466,8 @@ pg_mb_radix_conv(const pg_mb_radix_tree *rt,
 void
 UtfToLocal(const unsigned char *utf, int len,
 		   unsigned char *iso,
-		   const void *map,
-		   const void *cmap, int cmapsize,
+		   const pg_mb_radix_tree *map,
+		   const pg_utf_to_local *cmap, int cmapsize,
 		   utf_local_conversion_func conv_func,
 		   int encoding)
 {
@@ -600,8 +600,7 @@ UtfToLocal(const unsigned char *utf, int len,
 		/* Now check ordinary map */
 		if (map)
 		{
-			uint32 converted = pg_mb_radix_conv((pg_mb_radix_tree *)map,
-												l, b1, b2, b3, b4);
+			uint32 converted = pg_mb_radix_conv(map, l, b1, b2, b3, b4);
 			if (converted)
 			{
 				iso = store_coded_char(iso, converted);
@@ -658,8 +657,8 @@ UtfToLocal(const unsigned char *utf, int len,
 void
 LocalToUtf(const unsigned char *iso, int len,
 		   unsigned char *utf,
-		   const void *map,
-		   const void *cmap, int cmapsize,
+		   const pg_mb_radix_tree *map,
+		   const pg_local_to_utf *cmap, int cmapsize,
 		   utf_local_conversion_func conv_func,
 		   int encoding)
 {
@@ -725,8 +724,7 @@ LocalToUtf(const unsigned char *iso, int len,
 
 		if (map)
 		{
-			uint32 converted = pg_mb_radix_conv((pg_mb_radix_tree*)map,
-												l, b1, b2, b3, b4);
+			uint32 converted = pg_mb_radix_conv(map, l, b1, b2, b3, b4);
 
 			if (converted)
 			{
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index f51418f..7efa600 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -562,14 +562,14 @@ extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc);
 
 extern void UtfToLocal(const unsigned char *utf, int len,
 		   unsigned char *iso,
-		   const void *map,
-		   const void *combined_map, int cmapsize,
+		   const pg_mb_radix_tree *map,
+		   const pg_utf_to_local *cmap, int cmapsize,
 		   utf_local_conversion_func conv_func,
 		   int encoding);
 extern void LocalToUtf(const unsigned char *iso, int len,
 		   unsigned char *utf,
-		   const void *map,
-		   const void *combined_cmap, int cmapsize,
+		   const pg_mb_radix_tree *map,
+		   const pg_local_to_utf *cmap, int cmapsize,
 		   utf_local_conversion_func conv_func,
 		   int encoding);
 
-- 
2.9.2

>From 04248d7d4b23fdd5622265402599f1f4c5aa4744 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[email protected]>
Date: Mon, 5 Dec 2016 18:57:05 +0900
Subject: [PATCH 2/2] Several trivial fixes on convutils.pm

---
 src/backend/utils/mb/Unicode/convutils.pm | 28 +++++++++-------------------
 1 file changed, 9 insertions(+), 19 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/convutils.pm b/src/backend/utils/mb/Unicode/convutils.pm
index 30227fe..a5644a0 100644
--- a/src/backend/utils/mb/Unicode/convutils.pm
+++ b/src/backend/utils/mb/Unicode/convutils.pm
@@ -5,8 +5,6 @@
 
 use strict;
 
-use Data::Dumper;
-
 #######################################################################
 # convert UCS-4 to UTF-8
 #
@@ -376,7 +374,7 @@ sub print_radix_table
 	### Build a linear list of "segments", from the nested hashes.
 	###
 	### Each segment is a lookup table, keyed by the next byte in the input.
-	### The segments are written out physically to one big array in the final,
+	### The segments are written out physically to one big array in the final
 	### step, but logically, they form a radix tree. Or rather, four radix
 	### trees: one for 1-byte inputs, another for 2-byte inputs, 3-byte
 	### inputs, and 4-byte inputs.
@@ -415,14 +413,10 @@ sub print_radix_table
 		my $this_min = $min_idx{$seg->{depth}}->{$seg->{level}};
 		my $this_max = $max_idx{$seg->{depth}}->{$seg->{level}};
 
-		for (my $i=0; $i <= 0xff; $i++)
+		foreach my $i (keys $seg->{values})
 		{
-			my $val = $seg->{values}->{$i};
-			if ($val)
-			{
-				$this_min = $i if (!defined $this_min || $i < $this_min);
-				$this_max = $i if (!defined $this_max || $i > $this_max);
-			}
+			$this_min = $i if (!defined $this_min || $i < $this_min);
+			$this_max = $i if (!defined $this_max || $i > $this_max);
 		}
 
 		$min_idx{$seg->{depth}}{$seg->{level}} = $this_min;
@@ -517,10 +511,9 @@ sub print_radix_table
 	# Second pass: look up the offset of each label reference in the hash.
 	foreach my $seg (@segments)
 	{
-		for (my $i=0; $i <= 0xff; $i++)
+		while (my ($i, $val) = each %{$seg->{values}})
 		{
-			my $val = $seg->{values}->{$i};
-			if ($val && !($val =~ /^[0-9,.E]+$/ ))
+			if (!($val =~ /^[0-9,.E]+$/ ))
 			{
 				my $segoff = $segmap{$val};
 				if ($segoff)
@@ -573,9 +566,8 @@ sub print_radix_table
 	my $max_val = 0;
 	foreach my $seg (@segments)
 	{
-		for (my $i=0; $i <= 0xFF ; $i++)
+		foreach my $val (values $seg->{values})
 		{
-			my $val = $seg->{values}->{$i};
 			$max_val = $val if ($val > $max_val);
 		}
 	}
@@ -729,15 +721,13 @@ sub build_segments_recurse
 	{
 		my %children;
 
-		for (my $i=0; $i <= 0xff; $i++)
+		while (my ($i, $val) = each $map)
 		{
-			next if (!$map->{$i});
-
 			my $childpath = $path . sprintf("%02x", $i);
 			my $childlabel = "$depth-level-$level-$childpath";
 
 			push @segments, build_segments_recurse($header, $childlabel, $childpath,
-												   $level + 1, $depth, $map->{$i});
+												   $level + 1, $depth, $val);
 			$children{$i} = $childlabel;
 		}
 
-- 
2.9.2

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Radix tree for character conversion

Reply via email to