Jan Vesely <jan.ves...@rutgers.edu> writes: > On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote: >> Francisco Jerez <curroje...@riseup.net> writes: >> >> > Jan Vesely <jan.ves...@rutgers.edu> writes: >> > >> > > Hi, >> > > >> > > thanks for detailed explanation. I indeed missed the writeBuffer part >> > > in specs. >> > > >> > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote: >> > > > These changes are somewhat redundant and potentially >> > > > performance-impacting, the reason is that in the OpenCL API, >> > > > clEnqueueWrite* commands are specified to block until the memory >> > > > provided by the application as origin can be reused safely (i.e. until >> > > > soft_copy_op()() runs), not necessarily until the transfer to graphics >> > > > memory has completed (which is what hard_event::wait() will wait for). >> > > > OTOH reads and maps as implemented by soft_copy_op and friends are >> > > > essentially blocking so the wait() call is redundant in most cases. >> > > >> > > That explains a noticeable slowdown running piglit with these changes. >> > > I'm not sure about the read part though. I expected it to be basically >> > > a noop, but it changes behaviour. >> > >> > I think this change would have slowed you down the most whenever the >> > mapping operation performed by soft_copy_op() is able to proceed >> > immediately, either because the buffer is idle (so the driver doesn't >> > stall on transfer_map()) *or* because the driver is trying to be smart >> > and creates a bounce buffer where data can be copied into immediately >> > without stalling, then inserts a pipelined GPU copy from the bounce >> > buffer into the real buffer. With this patch you will stall on the GPU >> > copy regardless (and whatever other work was already on the command >> > stream which might be substantial), even though it wouldn't have been >> > necessary in any of these cases. >> > >> > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a >> > > blocking read in one of the piglit tests surprisingly returns >> > > CL_QUEUED. >> > > >> > >> > Hmm, yeah, that seems kind of debatable behaviour, although it's >> > definitely legit for writes, not quite sure for reads... I believe the >> > reason why that happens is because the CPU copy proceeds very quickly >> > (due to the reasons mentioned in the last paragraph), but the hard_event >> > is still associated with a pipe_fence synchronous with the GPU's command >> > stream, so it won't get signalled until the GPU catches up. > > Hi, > sorry for the delay, last week was submission week... >
No worries. > The part that I'm still missing is what kind of GPU work needs to be > done after clEnqueueRead*(). I assume all necessary work is completed > before I can access the data. > Also CL_QUEUED status was surprising. since we performed at least some > of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at > least. > The lag is not due to GPU work that needs to be performed after the clEnqueueRead call, but due to GPU work that may precede it in the command stream: Because clover doesn't know when the last time was that the buffer was referenced by GPU work, the event is associated with a fence synchronous with the current batch (which obviously won't signal before any of the GPU work that actually referenced the buffer completes). However the pipe driver has a more accurate idea of when the buffer was used last, so the transfer_map() operation is likely to complete more quickly than the CL event status changes to CL_COMPLETE. The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because the read operation didn't even need to flush the current batch, so there's no fence associated with the CL event object yet. >> > >> > > The specs don't mention use of events with blocking read, but it does >> > > say that "When the read command has completed, the contents of the >> > > buffer that ptr points to can be used by the application." in the non- >> > > blocking section. I'd say that the expectation is for the event to be >> > > CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk >> > > follow this). >> > > >> > > > >> > > > The only reason why it might be useful to behave differently on >> > > > blocking >> > > > transfers is that the application may have specified a user event in >> > > > the >> > > > event dependency list, which may cause the soft_copy_op() call to be >> > > > delayed until the application signals the user event. In order to fix >> > > > that it should really only be necessary to wait for the event action to >> > > > be executed, not necessarily its associated GPU work. >> > > >> > > I think that another use is that non-blocking writes do not create an >> > > extra copy of the buffer. Thus >> > > clEnqueueWriteBuffer(...,cl_false, ev, ...) >> > > clWaitForEvents(ev) >> > > is more memory efficient. >> > > >> > > > >> > > > Last time this issue came up (yeah it's not the first time) I proposed >> > > > the patches below to add a mechanism to wait for the event action only, >> > > > feel free to include it as PATCH 0.1 and 0.2 of this series (it's been >> > > > a >> > > > while so they may no longer apply cleanly). >> > > >> > > I think we can just add comments explaining why the blocking argument >> > > is ignored, until someone chooses to fix this problem >> > >> > I think the problem is definitely worth fixing, and it shouldn't really >> > take more effort than adding comments explaining the current behaviour >> > ;), basically just add a bunch of 'if (blocking) >> > hev().wait_signalled();' where the spec requires it, roughly as you had >> > been doing in this patch, but wait_signalled() should only stall on the >> > CPU work associated with the event, which should give you the same >> > performance as the current approach. > > I can send a patch that replaces wait() -> wait_signalled() > Thanks :) >> > >> > > and/or to >> > > implement proper non-blocking variants (would std::async work for >> > > trivial cases like ReadBuffer?) >> > > >> >> Hm, and to answer this question -- Yeah, std::async would probably work, >> but I'm not certain whether it would actually perform better than the >> current approach, because on the one hand the actual DMA-ing of the >> buffer is likely to happen quasi-asynchronously already assuming the >> driver is competent, and OTOH because spawning a new thread for the copy >> would introduce additional overhead that might defeat your purpose >> unless the copy is very large -- Only experimentation will tell whether >> it pays off. > > it was just a speculation. it looks like Vedran is interested in > implementing non-blocking reads/writes[0] so I'll leave it to him. r600 > has bigger problems elsewhere atm. > Yeah, I'm aware of his work, I suspect the above are the reasons why he got rather mixed performance results from his changes. > thanks, > Jan > > [0]https://bugs.freedesktop.org/show_bug.cgi?id=100199 > >> >> > > thanks, >> > > Jan >> > > >> > > > >> > > > Thank you. >> > > > >> > > > Jan Vesely <jan.ves...@rutgers.edu> writes: >> > > > >> > > > > v2: wait in map_buffer and map_image as well >> > > > > >> > > > > Signed-off-by: Jan Vesely <jan.ves...@rutgers.edu> >> > > > > --- >> > > > > Hi Aaron, >> > > > > >> > > > > yes, I think you're right, we should wait in Map* as well. >> > > > > If nothing else it's consistent, even if passing the flag to >> > > > > add_map might make it unnecessary (haven't actually checked). >> > > > > >> > > > > thanks, >> > > > > Jan >> > > > > >> > > > > src/gallium/state_trackers/clover/api/transfer.cpp | 30 >> > > > > ++++++++++++++++++++-- >> > > > > 1 file changed, 28 insertions(+), 2 deletions(-) >> > > > > >> > > > > diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp >> > > > > b/src/gallium/state_trackers/clover/api/transfer.cpp >> > > > > index f7046253be..729a34590e 100644 >> > > > > --- a/src/gallium/state_trackers/clover/api/transfer.cpp >> > > > > +++ b/src/gallium/state_trackers/clover/api/transfer.cpp >> > > > > @@ -295,6 +295,9 @@ clEnqueueReadBuffer(cl_command_queue d_q, cl_mem >> > > > > d_mem, cl_bool blocking, >> > > > > &mem, obj_origin, obj_pitch, >> > > > > region)); >> > > > > >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > ret_object(rd_ev, hev); >> > > > > return CL_SUCCESS; >> > > > > >> > > > > @@ -325,6 +328,9 @@ clEnqueueWriteBuffer(cl_command_queue d_q, >> > > > > cl_mem d_mem, cl_bool blocking, >> > > > > ptr, {}, obj_pitch, >> > > > > region)); >> > > > > >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > ret_object(rd_ev, hev); >> > > > > return CL_SUCCESS; >> > > > > >> > > > > @@ -362,6 +368,9 @@ clEnqueueReadBufferRect(cl_command_queue d_q, >> > > > > cl_mem d_mem, cl_bool blocking, >> > > > > &mem, obj_origin, obj_pitch, >> > > > > region)); >> > > > > >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > ret_object(rd_ev, hev); >> > > > > return CL_SUCCESS; >> > > > > >> > > > > @@ -399,6 +408,9 @@ clEnqueueWriteBufferRect(cl_command_queue d_q, >> > > > > cl_mem d_mem, cl_bool blocking, >> > > > > ptr, host_origin, host_pitch, >> > > > > region)); >> > > > > >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > ret_object(rd_ev, hev); >> > > > > return CL_SUCCESS; >> > > > > >> > > > > @@ -504,6 +516,9 @@ clEnqueueReadImage(cl_command_queue d_q, cl_mem >> > > > > d_mem, cl_bool blocking, >> > > > > &img, src_origin, src_pitch, >> > > > > region)); >> > > > > >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > ret_object(rd_ev, hev); >> > > > > return CL_SUCCESS; >> > > > > >> > > > > @@ -538,6 +553,9 @@ clEnqueueWriteImage(cl_command_queue d_q, cl_mem >> > > > > d_mem, cl_bool blocking, >> > > > > ptr, {}, src_pitch, >> > > > > region)); >> > > > > >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > ret_object(rd_ev, hev); >> > > > > return CL_SUCCESS; >> > > > > >> > > > > @@ -667,7 +685,11 @@ clEnqueueMapBuffer(cl_command_queue d_q, cl_mem >> > > > > d_mem, cl_bool blocking, >> > > > > >> > > > > void *map = mem.resource(q).add_map(q, flags, blocking, >> > > > > obj_origin, region); >> > > > > >> > > > > - ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_BUFFER, >> > > > > deps)); >> > > > > + auto hev = create<hard_event>(q, CL_COMMAND_MAP_BUFFER, deps); >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > + ret_object(rd_ev, hev); >> > > > > ret_error(r_errcode, CL_SUCCESS); >> > > > > return map; >> > > > > >> > > > > @@ -695,7 +717,11 @@ clEnqueueMapImage(cl_command_queue d_q, cl_mem >> > > > > d_mem, cl_bool blocking, >> > > > > >> > > > > void *map = img.resource(q).add_map(q, flags, blocking, origin, >> > > > > region); >> > > > > >> > > > > - ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_IMAGE, >> > > > > deps)); >> > > > > + auto hev = create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps); >> > > > > + if (blocking) >> > > > > + hev().wait(); >> > > > > + >> > > > > + ret_object(rd_ev, hev); >> > > > > ret_error(r_errcode, CL_SUCCESS); >> > > > > return map; >> > > > > >> > > > > -- >> > > > > 2.13.3 >> > > > >> > > > _______________________________________________ >> > > > mesa-dev mailing list >> > > > mesa-dev@lists.freedesktop.org >> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > > >> > > -- >> > > Jan Vesely <jan.ves...@rutgers.edu> > > -- > Jan Vesely <jan.ves...@rutgers.edu>
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev