Francisco Jerez <curroje...@riseup.net> writes: > Jan Vesely <jan.ves...@rutgers.edu> writes: > >> Hi, >> >> thanks for detailed explanation. I indeed missed the writeBuffer part >> in specs. >> >> On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote: >>> These changes are somewhat redundant and potentially >>> performance-impacting, the reason is that in the OpenCL API, >>> clEnqueueWrite* commands are specified to block until the memory >>> provided by the application as origin can be reused safely (i.e. until >>> soft_copy_op()() runs), not necessarily until the transfer to graphics >>> memory has completed (which is what hard_event::wait() will wait for). >>> OTOH reads and maps as implemented by soft_copy_op and friends are >>> essentially blocking so the wait() call is redundant in most cases. >> >> That explains a noticeable slowdown running piglit with these changes. >> I'm not sure about the read part though. I expected it to be basically >> a noop, but it changes behaviour. > > I think this change would have slowed you down the most whenever the > mapping operation performed by soft_copy_op() is able to proceed > immediately, either because the buffer is idle (so the driver doesn't > stall on transfer_map()) *or* because the driver is trying to be smart > and creates a bounce buffer where data can be copied into immediately > without stalling, then inserts a pipelined GPU copy from the bounce > buffer into the real buffer. With this patch you will stall on the GPU > copy regardless (and whatever other work was already on the command > stream which might be substantial), even though it wouldn't have been > necessary in any of these cases. > >> Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a >> blocking read in one of the piglit tests surprisingly returns >> CL_QUEUED. >> > > Hmm, yeah, that seems kind of debatable behaviour, although it's > definitely legit for writes, not quite sure for reads... I believe the > reason why that happens is because the CPU copy proceeds very quickly > (due to the reasons mentioned in the last paragraph), but the hard_event > is still associated with a pipe_fence synchronous with the GPU's command > stream, so it won't get signalled until the GPU catches up. > >> The specs don't mention use of events with blocking read, but it does >> say that "When the read command has completed, the contents of the >> buffer that ptr points to can be used by the application." in the non- >> blocking section. I'd say that the expectation is for the event to be >> CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk >> follow this). >> >>> >>> The only reason why it might be useful to behave differently on blocking >>> transfers is that the application may have specified a user event in the >>> event dependency list, which may cause the soft_copy_op() call to be >>> delayed until the application signals the user event. In order to fix >>> that it should really only be necessary to wait for the event action to >>> be executed, not necessarily its associated GPU work. >> >> I think that another use is that non-blocking writes do not create an >> extra copy of the buffer. Thus >> clEnqueueWriteBuffer(...,cl_false, ev, ...) >> clWaitForEvents(ev) >> is more memory efficient. >> >>> >>> Last time this issue came up (yeah it's not the first time) I proposed >>> the patches below to add a mechanism to wait for the event action only, >>> feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a >>> while so they may no longer apply cleanly). >> >> I think we can just add comments explaining why the blocking argument >> is ignored, until someone chooses to fix this problem > > I think the problem is definitely worth fixing, and it shouldn't really > take more effort than adding comments explaining the current behaviour > ;), basically just add a bunch of 'if (blocking) > hev().wait_signalled();' where the spec requires it, roughly as you had > been doing in this patch, but wait_signalled() should only stall on the > CPU work associated with the event, which should give you the same > performance as the current approach. > >> and/or to >> implement proper non-blocking variants (would std::async work for >> trivial cases like ReadBuffer?) >>
Hm, and to answer this question -- Yeah, std::async would probably work, but I'm not certain whether it would actually perform better than the current approach, because on the one hand the actual DMA-ing of the buffer is likely to happen quasi-asynchronously already assuming the driver is competent, and OTOH because spawning a new thread for the copy would introduce additional overhead that might defeat your purpose unless the copy is very large -- Only experimentation will tell whether it pays off. >> thanks, >> Jan >> >>> >>> Thank you. >>> >>> Jan Vesely <jan.ves...@rutgers.edu> writes: >>> >>> > v2: wait in map_buffer and map_image as well >>> > >>> > Signed-off-by: Jan Vesely <jan.ves...@rutgers.edu> >>> > --- >>> > Hi Aaron, >>> > >>> > yes, I think you're right, we should wait in Map* as well. >>> > If nothing else it's consistent, even if passing the flag to add_map >>> > might make it unnecessary (haven't actually checked). >>> > >>> > thanks, >>> > Jan >>> > >>> > src/gallium/state_trackers/clover/api/transfer.cpp | 30 >>> > ++++++++++++++++++++-- >>> > 1 file changed, 28 insertions(+), 2 deletions(-) >>> > >>> > diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp >>> > b/src/gallium/state_trackers/clover/api/transfer.cpp >>> > index f7046253be..729a34590e 100644 >>> > --- a/src/gallium/state_trackers/clover/api/transfer.cpp >>> > +++ b/src/gallium/state_trackers/clover/api/transfer.cpp >>> > @@ -295,6 +295,9 @@ clEnqueueReadBuffer(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > &mem, obj_origin, obj_pitch, >>> > region)); >>> > >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > ret_object(rd_ev, hev); >>> > return CL_SUCCESS; >>> > >>> > @@ -325,6 +328,9 @@ clEnqueueWriteBuffer(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > ptr, {}, obj_pitch, >>> > region)); >>> > >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > ret_object(rd_ev, hev); >>> > return CL_SUCCESS; >>> > >>> > @@ -362,6 +368,9 @@ clEnqueueReadBufferRect(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > &mem, obj_origin, obj_pitch, >>> > region)); >>> > >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > ret_object(rd_ev, hev); >>> > return CL_SUCCESS; >>> > >>> > @@ -399,6 +408,9 @@ clEnqueueWriteBufferRect(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > ptr, host_origin, host_pitch, >>> > region)); >>> > >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > ret_object(rd_ev, hev); >>> > return CL_SUCCESS; >>> > >>> > @@ -504,6 +516,9 @@ clEnqueueReadImage(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > &img, src_origin, src_pitch, >>> > region)); >>> > >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > ret_object(rd_ev, hev); >>> > return CL_SUCCESS; >>> > >>> > @@ -538,6 +553,9 @@ clEnqueueWriteImage(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > ptr, {}, src_pitch, >>> > region)); >>> > >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > ret_object(rd_ev, hev); >>> > return CL_SUCCESS; >>> > >>> > @@ -667,7 +685,11 @@ clEnqueueMapBuffer(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > >>> > void *map = mem.resource(q).add_map(q, flags, blocking, obj_origin, >>> > region); >>> > >>> > - ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_BUFFER, deps)); >>> > + auto hev = create<hard_event>(q, CL_COMMAND_MAP_BUFFER, deps); >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > + ret_object(rd_ev, hev); >>> > ret_error(r_errcode, CL_SUCCESS); >>> > return map; >>> > >>> > @@ -695,7 +717,11 @@ clEnqueueMapImage(cl_command_queue d_q, cl_mem >>> > d_mem, cl_bool blocking, >>> > >>> > void *map = img.resource(q).add_map(q, flags, blocking, origin, >>> > region); >>> > >>> > - ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps)); >>> > + auto hev = create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps); >>> > + if (blocking) >>> > + hev().wait(); >>> > + >>> > + ret_object(rd_ev, hev); >>> > ret_error(r_errcode, CL_SUCCESS); >>> > return map; >>> > >>> > -- >>> > 2.13.3 >>> >>> _______________________________________________ >>> mesa-dev mailing list >>> mesa-dev@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> >> -- >> Jan Vesely <jan.ves...@rutgers.edu>
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev