RE: [PATCH] COBOL 3/8 gen: GENERIC interface

Robert Dubner Mon, 23 Dec 2024 15:48:26 -0800

Richard, a bunch of things you address are in my bailwick.

When Jim and I set out to create a COBOL front end, I knew *NOTHING*
about, well, anything vis-à-vis GCC.  I barely knew how it worked.  Some
things I had to figure out even before I knew how to figure anything outl
notably, creating functions.  Creating a function was practically the
first thing I had to do, and I spent a bunch of time trapping through GCC
compiling simple functions to devine how it might be done.  By definition,
I simply did not know what I was doing, and learned as I went.


So, if there are peculiarities, that's probably why.  And I am going to
eagerly go through this message, and others, to apply the knowledge you
are conveying; I am in a far better position to understand what you are
saying than I would have been three years ago.

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Sunday, December 22, 2024 07:18
> To: jklow...@symas.com
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] COBOL 3/8 gen: GENERIC interface
>
>
> (sorry for breaking threading, but quoting the whole mail made my MUA
> unbearably slow)
>
> From 64bcb34e12371f61a8958645e1668e0ac2704391gen.patch 4 Oct 2024
12:01:22
> -0400
> From: "James K. Lowden" <jklow...@symas.com>
> Date: Thu 12 Dec 2024 06:28:07 PM EST
> Subject: [PATCH]  Add 'cobol' to 4 files
>
> gcc/cobol/ChangeLog
>         * genapi.cc: New file.
>         * gengen.cc: New file.
>         * genmath.cc: New file.
>         * genutil.cc: New file.
>
>
> I'm skimming over this part of the frontend, taking notes on general
> things I notice, partly quoting parts of the patch
>
> +#include <err.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <ctype.h>
> +#include <vector>
> +#include <string>
> +#include <unordered_map>
> +#include <unordered_set>
> +#include <chrono>
> +#include <time.h>
> +#include <math.h>
> +
> +#include "config.h"
> +#include "system.h"
>
> all system headers need to be included by system.h, C++ includes should
be
> included by defining INCLUDE_UNORDERED_MAP and INCLUDE_UNORDERED_SET,
etc.
> - includes definitely have to come after including config.h

I now see that many of the front ends seem to have their own lang-system.h
that wraps the gcc/system.h.  Under the assumption that the recently-added
GO and RUST are doing that in an acceptable fashion, I will model that
behavior.

>
> +//#define XXX do{gg_printf("LINE %d\n", build_int_cst_type(INT,
> __LINE__), NULL_TREE);}while(0);
> +#define XXX
> ...
> +bool bSHOW_PARSE = getenv("SHOW_PARSE");
>
> there seems to be debugging wired in - at least the getenv call above
> should probably be be behind a more global COBOL_DEBUG conditional?
> There's a langhook for frontend initialization - that's a convenient
place
> to conditionally (re-)initialize bSHOW_PARSE (when you default it to
> false).
>
> +  // One strongly suspects, one does, that creating the constructor one
> + // character at a time isn't the best possible way of accomplishing
> + the  // desired goal. But I lost interest in trying to find a better
> way.
> +  TYPE_NAME(array_of_characters) = get_identifier("cblc_string");  tree
> + constr = make_node(CONSTRUCTOR);
> +  TREE_TYPE(constr) = array_of_characters;
>
> indeed, you should be able to use build_string (strlen(var_contents)+1,
> var_contents) and use that as DECL_INTIIAL (it builds a STRING_CST
node).

That worked, thank you.

>
> +tree
> +gg_fprintf(tree fd, int nargs, const char *format_string, ...)
> +  {
> ...
> +  tree stmt = build_call_array_loc (location_from_lineno(),
> +                                    INT,
> +                                    function,
> +                                    argc,
> +                                    args);
>
> instead of constructing args[] you should be able to directly call
> build_call_valist () with the va_list.
>

In this case, I don't see how to use build_call_valist ().

gg_fprintf() takes a format_string and 'nargs' following arguments. It
then builds a call to

sprintf (buffer, format_string, <those nargs variables);

I don't see how to pass 'format string' to build_call_valist ().

> You seem to, in function_decl_from_name or gg_get_function_address,
build
> function decls for externs, mainly from the C library.  Elsewhere we are
> using what GCC calls "built-ins" for this, see gcc/builtins.def and
> tree.h:builtin_decl_explicit.  Unfortunately most builtins and the used
> types are to be initialized by frontends, luckily some are from the
> middle-end, see build_common_builtin_nodes - that includes
BUILT_IN_MEMCPY
> and friends.  For optimization purposes (that GCC knows that "memcpy" is
> memcpy) I suggest you see to properly declare
> BUILT_IN_* declarations - the Fortran frontend might be a good recipie
to
> copy from (see fortran/f95-lang.cc around where it includes
> mathbuiltins.def).

I will learn about builtins.


>
> +    if( !field->var_decl_node )
> +      {
> +      fprintf(  stderr,
> +                "%s (type: %s) improperly has a NULL var_decl_node\n",
> +                field->name,
> +                cbl_field_type_str(field->type));
> +      fprintf(  stderr,
> +                "Probable cause: it was referenced without being
> defined.\n");
> +      exit(1);
>
> With GCC diagnostics you'd use internal_error (...) which then
terminates
> GCC.
>
> +static void
> +assembler_label(const char *label, bool global = false)
> +  {
> +  // label has to be a valid label for the assembler
> +
> +  // use the global switch for COBOL ENTRY
> +
> +  const char global_text[] = ": .global ";
>
> there's target macros for how assemblers work, like ASM_OUTPUT_LABEL,
but
> it seems this function is about sections?  I'm not sure how
> gg_insert_into_assembler "works", but it seems to emit a ASM_EXPR into
the
> IL - in the section_label case in the toplevel?  For portability I'd
> strongly advise against the use of global assembler, you can always tell
> the middle-end what you desire here, for sections you can use
> set_decl_section_name for example.
>
> I'm not sure for what else you use ASM_EXPRs.

Hoo, boy.  We have a namespace collision.  The COBOL language -- which, I
remind you, got there first -- uses "SECTION" and "PARAGRAPH" to mean very
specific things that are part of the language specification. So, my use of
the word "section" has nothing to do with the sections in an ELF
executable.

COBOL sections and paragraphs (a section is a group of paragraphs) are
conceptually similar to C functions. Given a paragraph named FOO, you can
PERFORM FOO and the group of sentences (yes, a paragraph is made up of
sentences; I remind you that COBOL was originally designed to be readable
by non-programmers) are executed and control then returns to the statement
after the PERFORM.

For various reasons, execution into, through, and back from sections and
paragraphs must be implemented with GOTO statements, and cannot be
implemented with calls.

I nonetheless attempted, at one point, to implement PERFORM via calls, and
the ".global" you noticed is a vestige of that effort.  The routine it was
used in has a boolean variable 'global' that defaults to false, and the
routine was never called with that parameter set to true.  It is
unnecessary and has been deleted.

gg_insert_into_assembler() does indeed use ASM_EXPR. I sometimes use it to
generate #-delimited comments into the generated assembly language so that
I can see what's going on.

But I also use it to generate labeled locations in the executables.

I am also developing a GDB-COBOL version of GDB, one that understands the
executables GCOBOL is generating.  We need the GDB NEXT instruction to
execute through a PROC that is the subject of a PERFORM PROC statement.
And I need to be able to set a breakpoint with "(gdb)break PROC".

I have not yet figured out how to use GCC to put information into the
.debug_info section of a DWARF executable, and I have not yet figured out
how, in GDB, to extract such information.  So, for now, I am creating
labels with meta-information in them.

So, for example, for a program 'prog' with section 'mainsec' and paragraph
'graff', I use ASM_EXPR to create the label
"_para.graff.mainsec.prog.0.79:"  Thus, when a user enters "break graff"
or "break graff of mainsec", my code in GDB knows what to search for in
order to know where to set the breakpoint.

I apologize for the lengthy description here.  Sometimes, however, it is
unavoidable.  Sometimes, it is true, I do weird things out of ignorance.
But sometimes I am doing weird things because COBOL is, well, (and
certainly in the context of compilers and debuggers designed for C/C++),
weird.

>
> I noticed that you use build1 and friends sometimes without a location,
> elsewhere there's location_from_lineno ().  Parts of GCC use the global
> 'input_location' which you could set when sv_current_line_number
changes?
> GCC locations also track file (for C #include - not sure if relevant for
> Cobol) and column.

I will look into 'input_location'.

>
> +// These are globally useful constants
> +tree char_nodes[256];
> +
> +tree integer_minusone_node;
> +tree integer_two_node;
> +tree integer_eight_node;
> +tree size_t_zero_node;
> +tree int128_zero_node;
> +tree int128_five_node;
> +tree int128_ten_node;
>
> you might sooner or later run into garbage collector issues, either
> because you don't collect (so garbage piles up) or because you get
things
> collected that are still used (because it's not marked).  The above look
> like they want to be marked GTY(()) (maybe they are in a header file).

They are not marked GTY(()).  I don't know what in tarnation GTY means,
and I have never been able to figure it out.  And I don't know how, or
when, to do garbage collection.

I am grateful for any hint.  (I have been aware, as I have been working,
that I have been ignoring things that are probably important.  But I
figured I could track down problems as they occurred, and if no problems
showed up, well, as the old story goes, when it ain't rainin' the roof
don't leak nohow.  I couldn't wait to understand GCC completely before
writing code.)

>
> +    tree distance = build2( MULT_EXPR,
> +                            SIZE_T,
> +                            gg_cast(sizetype, offset),
> +                            TYPE_SIZE_UNIT(element_type));
>
> if offset is a constant then the above builts the multiplication
> expression, if you use fold_build2 (...) it would be simplified down to
a
> constant early.  With optimization it's going to be optimized anyway,
but
> as the multiplication isn't written by the user it might be tempting to
> simplify it in the frontend.

Brand new to me; I will look into it.

>
> +static tree
> +gg_get_larger_type(tree A, tree B)
> +  {
> +  tree larger = TREE_TYPE(B);
> +  if( TYPE_SIZE(TREE_TYPE(A)) > TYPE_SIZE(TREE_TYPE(B)) )
> +    {
> +    larger = TREE_TYPE(A);
>
> that doesn't work - TYPE_SIZE is a pointer to a tree.  You can use
> tree_int_cst_compare (TYPE_SIZE(TREE_TYPE(A)),
> TYPE_SIZE(TREE_TYPE(B))) == 1 instead.
>
> Maybe you are looking at arithmetic precision instead, in which case
> TYPE_PRECISION (TREE_TYPE (A)) > TYPE_PRECISION (TREE_TYPE (B)) would be
> appropriate (for example GCC supports int:3 and int:5 but they'd have
both
> TYPE_SIZE of 8 but precision of 3 and 5).
>
> Maybe Cobol doesn't have signed and unsigned types, iff it does it looks
> like gg_get_larger_type might care for TYPE_UNSIGNED as well somehow.

Oi.  Here come many words.  I don't like answering with too many words,
but sometimes answers are complicated and many words are required.

As we have implemented it, COBOL variables do not have "types" as GCC uses
the term.  A COBOL variable is a 112-byte cblc_field_t structure.  It
contains an enum that specifies the COBOL type (alphanumeric,
floating-point, packed-decimal, numeric display (where numbers are stored
as character strings of digits, e.g. "123"), binary, among other things).
The structure also specifies the number of bytes, the number of digits,
the number of digits to the right of a fixed-point decimal point.  There
is a 64-bit integer value full of attribute flags: whether or not the
value is signable, big-endian, little-endian.

A true loony -- I am loony only when it's really useful -- would embark
upon a quixotic religious quest to expand tree.def to encompass all that
COBOL crap, so that we would have DEFTREECODE(COBOL_PACKED_DECIMAL_TYPE,
"cobol_packed_decimal_type", tcc_type, 0) and go through the effort of
having that TYPE be defined with the appropriate attributes, and then
everything I've done would get absorbed into the middle-end which would
spit out the assembly language to implement the language.

Anybody who would want to do it that way should be thrown into a pit.
Anybody who would want to let them do it that way should be thrown in,
too.

In general, most of the work in the executable is being done in the
libgcobol.so code, written in C++.  When somebody wants to add a
six-character value represented as a string of EBCDIC numerals with two
digits to the right of a virtual decimal point, to a big-endian binary
value that has four significant digits with and implied multiplier of
1000, and assign the result to a 12-digit packed decimal number with three
digits to the right of an implied decimal point, there is an __gg__add()
routine in the libgcobol.so that does all that.

Welcome to COBOL.  Everything I just described there is four lines of
code:
77 A PIC  999V99  USAGE DISPLAY.
77 B PIC 9999PPP  USAGE BINARY
77 C PIC 9(9)V999 USAGE PACKED-DECIMAL.
ADD A TO B GIVING C

So, a lot of what I do in the generated GENERIC is pass pointers to those
cblc_field_t structures around.

But!  There is a lot of work that I don't want to do in the library for
performance reasons.  For example, if somebody defines a four-byte
little-endian binary value that gets used as a loop counter, I don't want
to be calling a library routine to decrement it, and another library
routine to check if it's zero.

So, I am generating GENERIC in a way that I regard as writing assembly
language to handle some of the internals, like subtracting a numeric
literal one from a four-byte little-endian binary.  I also hand-code stuff
like calculating offsets into tables from the table subscripts.  I didn't
want to put that into library code because critical inner loops often have
table subscripts and I don't want the overhead.

The upshot is that some benchmark routines, that are heavily inner-loop
dependent, run two to three times faster when compiled with GCOBOL than
when compiled with some others.

With that all said: Much of the code you are referencing here isn't used
directly in implementing user COBOL source.  It is rather used by me in
generating the "assembly language" to do things like array offset
calculations.

In any case, and given that gg_get_larger_type() is used by me in my code
on variables I define, rather than on user code on variables the user
defines, I believe it is doing what I need it to do.  For example, when I
need to multiply tree A by tree B, I need them to both be of the same type
in order to keep the TRUNC_DIV_EXPR from complaining that they aren't the
same type.  So, I use gg_get_larger_type(A, B) to return the type of the
one with the larger TYPE_SIZE, and then I cast both A and B to that type.

At least, that's what I think I am doing.  In my pile of index cards is
the one that says, "Eliminate gg_get_larger_type(), and then fix all the
resulting errors by defining the variables more sensibly in the first
place."

>
> +tree
> +gg_build_logical_expression(tree operand_a,
> +                            enum logop_t op,
> +                            tree operand_b)
> +  {
> +  tree logical_expression = NULL_TREE;
> +  tree_code logical_op;
> +  switch(op)
> +    {
> +    case and_op:
> +      logical_op = TRUTH_ANDIF_EXPR;
> ...
> +    case xor_op:
> +      logical_op = TRUTH_XOR_EXPR;
>
> note TRUTH_ANDIF_EXPR and TRUTH_XOR_EXPR behave differently with respect
> to evaluation of side-effects of the second operand.  There's no
> TRUTH_XORIF_EXPR, iff XORIF semantics are desired you'd have to lower
that
> explicitly (or ask for the middle-end to adopt a TRUTH_XORIF_EXPR).
>
> +__int128
> +get_power_of_ten(int n)
> +  {
>
> I think it was already mentioned - you eventually want to use wide_int
(or
> a fixed_wide_int_storage specialization), or for truly arbitrary
precision
> integers use gmp (the Fortran frontend does this).  The use of __int128
> restricts you to a subset of hosts the Cobol compiler can run on.

I will look into wide_int.  I have been reluctant to use gmp simply
because of the increased complexity, a belief that, over time, the number
of phone calls from users is directly proportional to the number of
prerequisite libraries, and because it wasn't that many months ago that
adding 1 to 2 and getting 3 was justification for breaking out the
champagne.  If gmp gets us more architectures without too much of a hit on
performance, we'll switch over.

>
>
> That's it for a skim through this large patch.  I'm curious whether you
> tried configuring with --enable-checking=yes (the default on trunk) and
if
> you tried to use LTO with Cobol yet?
> I also wonder about the DWARF generated with -g

In my debug builds, I have "--enable-checking", which I see is the same as
"enable-checking=yes". In our release builds we have,
"--enable-checking=release".

I have not tried to use LTO.

The DWARF is working nicely with the companion GDB-COBOL on code compiled
with '-ggdb -O0".  The GDB folks are, at best, dimly aware that I am
rummaging through their code.  I will use the occasion of a patch into GCC
as the starting gun for begging their forgiveness and enlisting their aid
incorporating my stuff.

You can use ordinary GDB to set breakpoints on lines and then step through
code created by GCOBOL. "print" won't work, however.

In response to some other comments from Joseph Meyers, I will report here
that I did some major surgery on the structure of the code.  I have rigged
things up so that libgcobol compiles on its own; it has no prerequisites
in gcc/cobol.  The gcc/cobol code does, of course, rely on libgcobol/ .h
files.  There are also two code modules that are compiled into both
libgcobol and gcc/cobol. (There are some tables for generating messages,
and a bunch of code for dealing with EBCDIC versus ASCII that's used both
at compile time and at run time.) Those source code modules now live in
libgcobol.  The Make-lang.in file now makes copies of those two files and
puts them into gcc/cobol.  So, libgcobol/charmaps.cc gets copied as
gcc/cobol/charmaps-copy.cc; line 1, a //comment line, gets replaced with
"// DO NOT EDIT THIS FILE.  It was copied from the libgcobol directory."

What this does to Jim's patching process, I don't know.  But I don't know
how to respond to input from you experts without changing the code. The
patching process will just have to catch up, I guess.

Thank you ever so much for your comments and advice.  It feels a little
like coming out from the cold.

Bob Dubner
rdub...@symas.com

>
> Thanks,
> Richard.

RE: [PATCH] COBOL 3/8 gen: GENERIC interface

Reply via email to