Re: [HACKERS] Radix tree for character conversion

Kyotaro HORIGUCHI Tue, 28 Feb 2017 00:35:45 -0800

At Tue, 28 Feb 2017 15:20:06 +0900, Michael Paquier <[email protected]> 
wrote in <cab7npqr49krgp6qaakal2v3hcnn+dnzv8dq_ysgbdsr6b_y...@mail.gmail.com>
> On Mon, Feb 27, 2017 at 5:37 PM, Kyotaro HORIGUCHI
> <[email protected]> wrote:
> > At Wed, 22 Feb 2017 16:06:14 +0900, Michael Paquier 
> > <[email protected]> wrote in 
> > <cab7npqrtq+7zjxuptbsr18mxvw7mtd29mn+91n7ag8fe5ac...@mail.gmail.com>
> >> In order to conduct sanity checks on the shape of the radix tree maps
> >> compared to the existing maps, having map_checker surely makes sense.
> >> Now in the final result I don't think we need it. The existing map
> >> files ought to be replaced by their radix versions at the end, and
> >> map_checker should be removed. This leads to a couple of
> >> simplifications, like Makefile, and reduces the maintenance to one
> >> mechanism.
> >
> > Hmm.. Though I don't remember clearly what the radix map of the
> > first version looked like, the current radix map seems
> > human-readable for me. It might be by practice or by additional
> > comments in map files. Anyway I removed all of the stuff so as
> > not to generate the plain maps. But I didn't change the names of
> > _radix.map and just commented out the line to output the plain
> > maps in UCS_to_*.pl.  Combined maps are still in the plain format
> > so print_tables was changed to take character tables separately
> > for regular (non-combined) characters and combined characters.
> 
> Do others have thoughts to offer on the matter? I would think that the
> new radix maps should just replace by the old plain ones, and that the
> only way to build the maps going forward is to use the new methods.
> The radix trees is the only thing used in the backend code as well
> (conv.c). We could keep the way to build the old maps, with the
> map_checker in module out of the core code. FWIW, I am fine to add the
> old APIs in my plugin repository on github and have the sanity checks
> in that as well. And of course also publish on this thread a module to
> do that.


I couldn't make out my mind to move to radix tree completely, but
UtfToLocal/LocalToUtf no longer handle the "plain map"s for
non-combined character so they have lost their planground. Okay,
I think I removed all the trace of the plain map era.

Every characters in a mapping has a comment that describes what
the character is or where it is defined. This information is no
longer useful (radix map doesn't have a plance to show it) but
I left it for debug use. (This might just be justification..)

> > - Split the property {direction} into two boolean properties
> >   {to_unicode} and {from_unicode}.
> >
> > - Make the {direction} property an integer and compared with
> >   defined constants $BOTH, $TO_UNICODE and $FROM_UNICODE using
> >   the '=' operator.
> >
> > I choosed the former in this patch.
> 
> Fine for me.

Thanks.

> >> +           $charmap{ ucs2utf($src) } = $dst;
> >> +       }
> >> +
> >> +   }
> >> Unnecessary newline here.
> >
> > removed in convutils.pm.
> >
> > Since Makefile ignores old .map files, the steps to generate a
> > patch for map files was a bit chaged.
> >
> > $ rm *.map
> > $ make distclean maintainer-clean all
> > $ make distclean
> > $ git add .
> > $ git commit
> 
> +# ignore generated files
> +/map_checker
> +/map_checker.h
> [...]
> +map_checker.h: make_mapchecker.pl $(MAPS) $(RADIXMAPS)
> +   $(PERL) $<
> +
> +map_checker.o: map_checker.c map_checker.h ../char_converter.c
> +
> +map_checker: map_checker.o
> With map_checker out of the game, those things are not needed.

Ouch! Thanks for pointing out it. Removed.

> +++ b/src/backend/utils/mb/char_converter.c
> @@ -0,0 +1,116 @@
> +/*-------------------------------------------------------------------------
> + *
> + *   Character converter function using radix tree
> In the simplified version of the patch, pg_mb_radix_conv() being only
> needed in conv.c I think that this could just be a static local
> routine.
> 
> -#include "../../Unicode/utf8_to_koi8r.map"
> -#include "../../Unicode/koi8r_to_utf8.map"
> -#include "../../Unicode/utf8_to_koi8u.map"
> -#include "../../Unicode/koi8u_to_utf8.map"
> +#include "../../Unicode/utf8_to_koi8r_radix.map"
> +#include "../../Unicode/koi8r_to_utf8_radix.map"
> +#include "../../Unicode/utf8_to_koi8u_radix.map"
> +#include "../../Unicode/koi8u_to_utf8_radix.map"
> FWIW, I am fine to use those new names as include points.
> 
> -distclean: clean
> +distclean:
>     rm -f $(TEXTS)
> -maintainer-clean: distclean
> +# maintainer-clean intentionally leaves $(TEXTS)
> +maintainer-clean:
> Why is that? There is also a useless diff down that code block.

It *was* for convenience but now it is automatically downloaded
so such distinction donsn't offer anything good. Changed it to
remove $(TEXTS).

> +conv.o: conv.c char_converter.c
> This also can go away.

Touching char_converter.c will be ignored if it is removed. Did
you mistake it for map_checker?

> -print_tables("EUC_JIS_2004", \@all, 1);
> +# print_tables("EUC_JIS_2004", \@regular, undef, 1);
> +print_radix_trees("EUC_JIS_2004", \@regular);
> +print_tables("EUC_JIS_2004", undef, \@combined, 1);
> [...]
>  sub print_tables
>  {
> -   my ($charset, $table, $verbose) = @_;
> +   my ($charset, $regular, $combined, $verbose) = @_;
> print_tables is only used for combined maps, you could remove $regular
> from it and just keep $combined around, perhaps renaming print_tables
> to print_combined_maps?

Renamed to print_combied_maps.

And the code-comment pointed in the comment by the previous mail
is rewritten as Robert's suggestion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

0001-Use-radix-tree-for-character-conversion.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Radix tree for character conversion

Reply via email to