On 27/09/2022 14:16, Tobias Burnus wrote:
@@ -422,6 +428,12 @@ struct agent_info
if it has been. */
bool initialized;
+ /* Flag whether the HSA program that consists of all the modules has been
+ finalized. */
+ bool prog_finalized;
+ /* Flag whether the HSA OpenMP's requires_reverse_offload has been used. */
+ bool has_reverse_offload;
+
/* The instruction set architecture of the device. */
gcn_isa device_isa;
/* Name of the agent. */
@@ -456,9 +468,6 @@ struct agent_info
thread should have locked agent->module_rwlock for reading before
acquiring it. */
pthread_mutex_t prog_mutex;
- /* Flag whether the HSA program that consists of all the modules has been
- finalized. */
- bool prog_finalized;
/* HSA executable - the finalized program that is used to locate kernels. */
hsa_executable_t executable;
};
Why has prog_finalized been moved?
Andrew did suggest a while back to piggyback on the console_output handling,
avoiding another atomic access. - If this is still wanted, I like to have some
guidance regarding how to actually implement it.
The console output ring buffer has the following type:
struct output {
int return_value;
unsigned int next_output;
struct printf_data {
int written;
char msg[128];
int type;
union {
int64_t ivalue;
double dvalue;
char text[128];
};
} queue[1024];
unsigned int consumed;
} output_data;
That is, for each entry in the buffer there is a 128-byte message
string, an integer argument-type identifier, and a 128-byte argument
field. Before we had printf we had functions that could print
string+int (gomp_print_integer, type==0), string+double
(gomp_print_double, type==1) and string+string (gomp_print_string,
type==2). The string conversion could then be done on the host to keep
the target code simple. These would still be useful functions if you
want to dump debug quickly without affecting performance so much, but I
don't think they ever got upstreamed because somebody (who should have
known better!) created an unrelated function upstream with the same name
(gomp_print_string) and we already had working printf by then so the
effort to fix it wasn't worth it.
The current printf implementation (actually the write syscall), uses
type==3 to print 256-bytes of output, per packet, with no implied newline.
The point is that you can use the "msg" and "text" fields for whatever
data you want, as long as you invent a new value for "type".
The current loop has:
switch (data->type)
{
case 0: printf ("%.128s%ld\n", data->msg, data->ivalue); break;
case 1: printf ("%.128s%f\n", data->msg, data->dvalue); break;
case 2: printf ("%.128s%.128s\n", data->msg, data->text); break;
case 3: printf ("%.128s%.128s", data->msg, data->text); break;
default: printf ("GCN print buffer error!\n"); break;
}
You can make "case 4" do whatever you want. There are enough bytes for 4
pointers, and you could use multiple packets (although it's not safe to
assume they're contiguous or already arrived; maybe "case 4" for part 1,
"case 5" for part 2). It's possible to change this structure, of course,
but the target implementation is in newlib so versioning becomes a problem.
Reusing this would remove the need for has_reverse_offload, since the
console output is scanned anyway, and also eliminate rev_ptr, rev_data,
and means that, hypothetically, the device can queue up reverse offload
requests asynchronously in the ring buffer (you'd need to ensure
multi-part packets don't get interleaved though).
Andrew