Thanks Jo ! so now I'm working on the OpenCL port. I have this function in the IOP :
#ifdef HAVE_OPENCL int process_cl(struct dt_iop_module_t *self, dt_dev_pixelpipe_iop_t *piece, cl_mem dev_in, cl_mem dev_out, const dt_iop_roi_t *const roi_in, const dt_iop_roi_t *const roi_out) { dt_iop_profilegamma_data_t *d = (dt_iop_profilegamma_data_t *)piece->data; dt_iop_profilegamma_global_data_t *gd = (dt_iop_profilegamma_global_data_t *)self->data; cl_int err = -999; const int devid = piece->pipe->devid; const int width = roi_in->width; const int height = roi_in->height; const float noise = powf(2., d->noise_level); const float grey = d->grey_point / 100.; size_t sizes[3] = { ROUNDUPWD(width), ROUNDUPHT(height), 1 }; dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 0, sizeof(cl_mem), (void *)&dev_in); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 1, sizeof(cl_mem), (void *)&dev_out); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 2, sizeof(int), (void *)&width); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 3, sizeof(int), (void *)&height); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 4, sizeof(float), (void *)&(d->camera_factor)); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 5, sizeof(float), (void *)&(d->dynamic_range)); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 6, sizeof(float), (void *)&noise); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 7, sizeof(float), (void *)&(d->shadows_range)); dt_opencl_set_kernel_arg(devid, gd->kernel_profilegamma_log, 8, sizeof(float), (void *)&grey); err = dt_opencl_enqueue_kernel_2d(devid, gd->kernel_profilegamma_log, sizes); if(err != CL_SUCCESS) goto error; return TRUE; error: dt_print(DT_DEBUG_OPENCL, "[opencl_profilegamma_log] couldn't enqueue kernel! %d\n", err); return FALSE; } #endif and this one in basic.cl : kernel void profilegamma_log (read_only image2d_t in, write_only image2d_t out, unsigned int width, unsigned int height, float factor, float range, float noise_level, float shadows_range, float grey) { const unsigned int x = get_global_id(0); const unsigned int y = get_global_id(1); if(x >= height || y >= width) return; float4 i = read_imagef(in, sampleri, (int2)(x, y)); const float4 min_val = noise_level / (noise_level + grey); const float4 Logmin_val = log2(min_val); i.xyz = (log2(factor * (i.xyz + noise_level) / (grey + noise_level)) - shadows_range) / range; write_imagef(out, (int2)(x, y), i); } While trying to run it, I get : 1,376046 [opencl_create_kernel] successfully loaded kernel `profilegamma_log' (165) for device 0 … 7,995011 [opencl_events_flush] could not get event info for '[Read Image (from device to host)]': -9999 7,995018 [opencl_events_flush] could not get event info for 'profilegamma_log': -9999 7,995021 [opencl_pixelpipe] could not run module 'profile_gamma' on gpu. falling back to cpu path 7,995208 [opencl_pixelpipe (b)] late opencl error detected while copying back to cpu buffer: -5 Any clue about what could be happening ? Thanks, Aurélien. Le 18/09/2018 à 03:26, johannes hanika a écrit : > heya, > > [..] >> Actually, I even wonder if OpenCL is relevant for this as it's a linear >> operation performed on only one pixel at the time over the flatten array. I >> wouldn't be surprised if the OpenCL version were slower on some systems than >> a good SSE2 version. > maybe on some systems. the thing with opencl is that you need to copy > the buffer to the gpu and back at the end. if you have one module that > interrupts the pipeline, you'll need to copy more (get your input > buffer back to the cpu, process, copy back to gpu). this slows down > the whole process significantly, even if the module would run at same > speed on both devices. > > >> Considering the code itself, my only remarks are for this line: >> for(size_t k = 1; k < (size_t)ch * roi_out->width * roi_out->height; >> k++) >> First, is there a reason why you are using a size_t type? int or unsigned >> would be fine I think, and you wouldn't need a cast. > you definitely want 64 bits for the counter if you go width*height > (times channel count here, too). size_t happens to be unsigned 64-bit > int on many systems. using stdint.h you could use uint64_t to be even > clearer and maybe more portable. note that you could have used a > nested loop for y and for x together with an openmp annotation > "collapse(2)" to get similar results. > > and yes, please start at 0 :) > > cheers, > jo > ___________________________________________________________________________ > darktable developer mailing list > to unsubscribe send a mail to darktable-dev+unsubscr...@lists.darktable.org > ___________________________________________________________________________ darktable developer mailing list to unsubscribe send a mail to darktable-dev+unsubscr...@lists.darktable.org