On Mon, Aug 14, 2017 at 4:29 PM, Aaron Watry <awa...@gmail.com> wrote: > On Mon, Aug 14, 2017 at 1:53 PM, Francisco Jerez <curroje...@riseup.net> > wrote: >> Aaron Watry <awa...@gmail.com> writes: >> >>> On Sat, Aug 12, 2017 at 10:14 PM, Francisco Jerez <curroje...@riseup.net> >>> wrote: >>>> Jan Vesely <jan.ves...@rutgers.edu> writes: >>>> >>>>> On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote: >>>>>> Francisco Jerez <curroje...@riseup.net> writes: >>>>>> >>>>>> > Jan Vesely <jan.ves...@rutgers.edu> writes: >>>>>> > >>>>>> > > Hi, >>>>>> > > >>>>>> > > thanks for detailed explanation. I indeed missed the writeBuffer part >>>>>> > > in specs. >>>>>> > > >>>>>> > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote: >>>>>> > > > These changes are somewhat redundant and potentially >>>>>> > > > performance-impacting, the reason is that in the OpenCL API, >>>>>> > > > clEnqueueWrite* commands are specified to block until the memory >>>>>> > > > provided by the application as origin can be reused safely (i.e. >>>>>> > > > until >>>>>> > > > soft_copy_op()() runs), not necessarily until the transfer to >>>>>> > > > graphics >>>>>> > > > memory has completed (which is what hard_event::wait() will wait >>>>>> > > > for). >>>>>> > > > OTOH reads and maps as implemented by soft_copy_op and friends are >>>>>> > > > essentially blocking so the wait() call is redundant in most cases. >>>>>> > > >>>>>> > > That explains a noticeable slowdown running piglit with these >>>>>> > > changes. >>>>>> > > I'm not sure about the read part though. I expected it to be >>>>>> > > basically >>>>>> > > a noop, but it changes behaviour. >>>>>> > >>>>>> > I think this change would have slowed you down the most whenever the >>>>>> > mapping operation performed by soft_copy_op() is able to proceed >>>>>> > immediately, either because the buffer is idle (so the driver doesn't >>>>>> > stall on transfer_map()) *or* because the driver is trying to be smart >>>>>> > and creates a bounce buffer where data can be copied into immediately >>>>>> > without stalling, then inserts a pipelined GPU copy from the bounce >>>>>> > buffer into the real buffer. With this patch you will stall on the GPU >>>>>> > copy regardless (and whatever other work was already on the command >>>>>> > stream which might be substantial), even though it wouldn't have been >>>>>> > necessary in any of these cases. >>>>>> > >>>>>> > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a >>>>>> > > blocking read in one of the piglit tests surprisingly returns >>>>>> > > CL_QUEUED. >>>>>> > > >>>>>> > >>>>>> > Hmm, yeah, that seems kind of debatable behaviour, although it's >>>>>> > definitely legit for writes, not quite sure for reads... I believe the >>>>>> > reason why that happens is because the CPU copy proceeds very quickly >>>>>> > (due to the reasons mentioned in the last paragraph), but the >>>>>> > hard_event >>>>>> > is still associated with a pipe_fence synchronous with the GPU's >>>>>> > command >>>>>> > stream, so it won't get signalled until the GPU catches up. >>>>> >>>>> Hi, >>>>> sorry for the delay, last week was submission week... >>>>> >>>> >>>> No worries. >>>> >>>>> The part that I'm still missing is what kind of GPU work needs to be >>>>> done after clEnqueueRead*(). I assume all necessary work is completed >>>>> before I can access the data. >>>>> Also CL_QUEUED status was surprising. since we performed at least some >>>>> of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at >>>>> least. >>>>> >>>> >>>> The lag is not due to GPU work that needs to be performed after the >>>> clEnqueueRead call, but due to GPU work that may precede it in the >>>> command stream: Because clover doesn't know when the last time was that >>>> the buffer was referenced by GPU work, the event is associated with a >>>> fence synchronous with the current batch (which obviously won't signal >>>> before any of the GPU work that actually referenced the buffer >>>> completes). However the pipe driver has a more accurate idea of when >>>> the buffer was used last, so the transfer_map() operation is likely to >>>> complete more quickly than the CL event status changes to CL_COMPLETE. >>>> The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because >>>> the read operation didn't even need to flush the current batch, so >>>> there's no fence associated with the CL event object yet. >>> >>> Speaking of event status issues, I've been sitting on the attached >>> patch (and some others) until my current series dealing with language >>> versions is dealt with. >>> >>> Basically, our clSetEventCallback implementation is ignoring several >>> event statuses that *should* be triggering the callbacks, and is >>> instead generating errors which cause CTS failures. >>> >>> --Aaron >>> >>>> >>>>>> > >>>>>> > > The specs don't mention use of events with blocking read, but it does >>>>>> > > say that "When the read command has completed, the contents of the >>>>>> > > buffer that ptr points to can be used by the application." in the >>>>>> > > non- >>>>>> > > blocking section. I'd say that the expectation is for the event to be >>>>>> > > CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk >>>>>> > > follow this). >>>>>> > > >>>>>> > > > >>>>>> > > > The only reason why it might be useful to behave differently on >>>>>> > > > blocking >>>>>> > > > transfers is that the application may have specified a user event >>>>>> > > > in the >>>>>> > > > event dependency list, which may cause the soft_copy_op() call to >>>>>> > > > be >>>>>> > > > delayed until the application signals the user event. In order to >>>>>> > > > fix >>>>>> > > > that it should really only be necessary to wait for the event >>>>>> > > > action to >>>>>> > > > be executed, not necessarily its associated GPU work. >>>>>> > > >>>>>> > > I think that another use is that non-blocking writes do not create an >>>>>> > > extra copy of the buffer. Thus >>>>>> > > clEnqueueWriteBuffer(...,cl_false, ev, ...) >>>>>> > > clWaitForEvents(ev) >>>>>> > > is more memory efficient. >>>>>> > > >>>>>> > > > >>>>>> > > > Last time this issue came up (yeah it's not the first time) I >>>>>> > > > proposed >>>>>> > > > the patches below to add a mechanism to wait for the event action >>>>>> > > > only, >>>>>> > > > feel free to include it as PATCH 0.1 and 0.2 of this series (it's >>>>>> > > > been a >>>>>> > > > while so they may no longer apply cleanly). >>>>>> > > >>>>>> > > I think we can just add comments explaining why the blocking argument >>>>>> > > is ignored, until someone chooses to fix this problem >>>>>> > >>>>>> > I think the problem is definitely worth fixing, and it shouldn't really >>>>>> > take more effort than adding comments explaining the current behaviour >>>>>> > ;), basically just add a bunch of 'if (blocking) >>>>>> > hev().wait_signalled();' where the spec requires it, roughly as you had >>>>>> > been doing in this patch, but wait_signalled() should only stall on the >>>>>> > CPU work associated with the event, which should give you the same >>>>>> > performance as the current approach. >>>>> >>>>> I can send a patch that replaces wait() -> wait_signalled() >>>>> >>>> >>>> Thanks :) >>>> >>>>>> > >>>>>> > > and/or to >>>>>> > > implement proper non-blocking variants (would std::async work for >>>>>> > > trivial cases like ReadBuffer?) >>>>>> > > >>>>>> >>>>>> Hm, and to answer this question -- Yeah, std::async would probably work, >>>>>> but I'm not certain whether it would actually perform better than the >>>>>> current approach, because on the one hand the actual DMA-ing of the >>>>>> buffer is likely to happen quasi-asynchronously already assuming the >>>>>> driver is competent, and OTOH because spawning a new thread for the copy >>>>>> would introduce additional overhead that might defeat your purpose >>>>>> unless the copy is very large -- Only experimentation will tell whether >>>>>> it pays off. >>>>> >>>>> it was just a speculation. it looks like Vedran is interested in >>>>> implementing non-blocking reads/writes[0] so I'll leave it to him. r600 >>>>> has bigger problems elsewhere atm. >>>>> >>>> >>>> Yeah, I'm aware of his work, I suspect the above are the reasons why he >>>> got rather mixed performance results from his changes. >>>> >>>>> thanks, >>>>> Jan >>>>> >>>>> [0]https://bugs.freedesktop.org/show_bug.cgi?id=100199 >>>>> >>>>>> >>>>>> > > thanks, >>>>>> > > Jan >>>>>> > > >>>>>> > > > >>>>>> > > > Thank you. >>>>>> > > > >>>>>> > > > Jan Vesely <jan.ves...@rutgers.edu> writes: >>>>>> > > > >>>>>> > > > > v2: wait in map_buffer and map_image as well >>>>>> > > > > >>>>>> > > > > Signed-off-by: Jan Vesely <jan.ves...@rutgers.edu> >>>>>> > > > > --- >>>>>> > > > > Hi Aaron, >>>>>> > > > > >>>>>> > > > > yes, I think you're right, we should wait in Map* as well. >>>>>> > > > > If nothing else it's consistent, even if passing the flag to >>>>>> > > > > add_map might make it unnecessary (haven't actually checked). >>>>>> > > > > >>>>>> > > > > thanks, >>>>>> > > > > Jan >>>>>> > > > > >>>>>> > > > > src/gallium/state_trackers/clover/api/transfer.cpp | 30 >>>>>> > > > > ++++++++++++++++++++-- >>>>>> > > > > 1 file changed, 28 insertions(+), 2 deletions(-) >>>>>> > > > > >>>>>> > > > > diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp >>>>>> > > > > b/src/gallium/state_trackers/clover/api/transfer.cpp >>>>>> > > > > index f7046253be..729a34590e 100644 >>>>>> > > > > --- a/src/gallium/state_trackers/clover/api/transfer.cpp >>>>>> > > > > +++ b/src/gallium/state_trackers/clover/api/transfer.cpp >>>>>> > > > > @@ -295,6 +295,9 @@ clEnqueueReadBuffer(cl_command_queue d_q, >>>>>> > > > > cl_mem d_mem, cl_bool blocking, >>>>>> > > > > &mem, obj_origin, obj_pitch, >>>>>> > > > > region)); >>>>>> > > > > >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > ret_object(rd_ev, hev); >>>>>> > > > > return CL_SUCCESS; >>>>>> > > > > >>>>>> > > > > @@ -325,6 +328,9 @@ clEnqueueWriteBuffer(cl_command_queue d_q, >>>>>> > > > > cl_mem d_mem, cl_bool blocking, >>>>>> > > > > ptr, {}, obj_pitch, >>>>>> > > > > region)); >>>>>> > > > > >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > ret_object(rd_ev, hev); >>>>>> > > > > return CL_SUCCESS; >>>>>> > > > > >>>>>> > > > > @@ -362,6 +368,9 @@ clEnqueueReadBufferRect(cl_command_queue >>>>>> > > > > d_q, cl_mem d_mem, cl_bool blocking, >>>>>> > > > > &mem, obj_origin, obj_pitch, >>>>>> > > > > region)); >>>>>> > > > > >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > ret_object(rd_ev, hev); >>>>>> > > > > return CL_SUCCESS; >>>>>> > > > > >>>>>> > > > > @@ -399,6 +408,9 @@ clEnqueueWriteBufferRect(cl_command_queue >>>>>> > > > > d_q, cl_mem d_mem, cl_bool blocking, >>>>>> > > > > ptr, host_origin, host_pitch, >>>>>> > > > > region)); >>>>>> > > > > >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > ret_object(rd_ev, hev); >>>>>> > > > > return CL_SUCCESS; >>>>>> > > > > >>>>>> > > > > @@ -504,6 +516,9 @@ clEnqueueReadImage(cl_command_queue d_q, >>>>>> > > > > cl_mem d_mem, cl_bool blocking, >>>>>> > > > > &img, src_origin, src_pitch, >>>>>> > > > > region)); >>>>>> > > > > >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > ret_object(rd_ev, hev); >>>>>> > > > > return CL_SUCCESS; >>>>>> > > > > >>>>>> > > > > @@ -538,6 +553,9 @@ clEnqueueWriteImage(cl_command_queue d_q, >>>>>> > > > > cl_mem d_mem, cl_bool blocking, >>>>>> > > > > ptr, {}, src_pitch, >>>>>> > > > > region)); >>>>>> > > > > >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > ret_object(rd_ev, hev); >>>>>> > > > > return CL_SUCCESS; >>>>>> > > > > >>>>>> > > > > @@ -667,7 +685,11 @@ clEnqueueMapBuffer(cl_command_queue d_q, >>>>>> > > > > cl_mem d_mem, cl_bool blocking, >>>>>> > > > > >>>>>> > > > > void *map = mem.resource(q).add_map(q, flags, blocking, >>>>>> > > > > obj_origin, region); >>>>>> > > > > >>>>>> > > > > - ret_object(rd_ev, create<hard_event>(q, >>>>>> > > > > CL_COMMAND_MAP_BUFFER, deps)); >>>>>> > > > > + auto hev = create<hard_event>(q, CL_COMMAND_MAP_BUFFER, >>>>>> > > > > deps); >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > + ret_object(rd_ev, hev); >>>>>> > > > > ret_error(r_errcode, CL_SUCCESS); >>>>>> > > > > return map; >>>>>> > > > > >>>>>> > > > > @@ -695,7 +717,11 @@ clEnqueueMapImage(cl_command_queue d_q, >>>>>> > > > > cl_mem d_mem, cl_bool blocking, >>>>>> > > > > >>>>>> > > > > void *map = img.resource(q).add_map(q, flags, blocking, >>>>>> > > > > origin, region); >>>>>> > > > > >>>>>> > > > > - ret_object(rd_ev, create<hard_event>(q, >>>>>> > > > > CL_COMMAND_MAP_IMAGE, deps)); >>>>>> > > > > + auto hev = create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps); >>>>>> > > > > + if (blocking) >>>>>> > > > > + hev().wait(); >>>>>> > > > > + >>>>>> > > > > + ret_object(rd_ev, hev); >>>>>> > > > > ret_error(r_errcode, CL_SUCCESS); >>>>>> > > > > return map; >>>>>> > > > > >>>>>> > > > > -- >>>>>> > > > > 2.13.3 >>>>>> > > > >>>>>> > > > _______________________________________________ >>>>>> > > > mesa-dev mailing list >>>>>> > > > mesa-dev@lists.freedesktop.org >>>>>> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>>> > > >>>>>> > > -- >>>>>> > > Jan Vesely <jan.ves...@rutgers.edu> >>>>> >>>>> -- >>>>> Jan Vesely <jan.ves...@rutgers.edu> >>>> >>>> _______________________________________________ >>>> mesa-dev mailing list >>>> mesa-dev@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>> >>> From ef827d9b06c2061d9eb198f202399d90ea261208 Mon Sep 17 00:00:00 2001 >>> From: Aaron Watry <awa...@gmail.com> >>> Date: Thu, 3 Aug 2017 20:55:18 -0500 >>> Subject: [PATCH] clover/event: Include additional event statuses for >>> clSetEventCallback >>> >>> From CL 1.2 Section 5.9: >>> The registered callback function will be called when the execution >>> status of command associated with event changes to an execution >>> status equal to or past the status specified by command_exec_status. >>> >>> CL_COMPLETE is equal to or past any of: submitted/queued/running. >>> >> >> That quotation doesn't really imply that other event status codes should >> be accepted. In fact the same section of the same CL spec claims: >> >> "clSetEventCallback returns CL_SUCCESS if the function is executed >> successfully. Otherwise, it returns one of the following errors: [..] >> CL_INVALID_VALUE if [..] command_exec_callback_type is not CL_COMPLETE." >> >> Is the spec contradicting itself? > > I think it might be. The quote that you have from above (page 184 of > the 1.2 spec) indicates that CL_COMPLETE is the only valid status in > this case, but if you check out the previous page (183):
Looks like they clarified this in the CL 2.0 spec. CL_COMPLETE, CL_SUBMITTED, CL_RUNNING are all valid values. CL_QUEUED is NOT in that list. --Aaron > > command_exec_callback_type is described as: > specifies the command execution status for which the callback is > registered. The command execution callback values for which a > callback can be registered are: > CL_SUBMITTED , CL_RUNNING or CL_COMPLETE[20] . There is no guarantee > that the callback > functions registered for various execution status values for an > event will be called in the exact > order that the execution status of a command changes. Furthermore, > it should be noted that > receiving a call back for an event with a status other than > CL_COMPLETE , in no way implies that > the memory model or execution model as defined by the OpenCL > specification has changed. For > example, it is not valid to assume that a corresponding memory > transfer has completed unless the > event is in a state CL_COMPLETE . > > Footnote 20 is: > The callback function registered for a command_exec_callback_type > value of CL_COMPLETE will be called > when the command has completed successfully or is abnormally terminated. > > > >> >>> Fixes: OpenCL CTS test_conformance/events/test_events callbacks >>> >>> Signed-off-by: Aaron Watry <awa...@gmail.com >>> --- >>> src/gallium/state_trackers/clover/api/event.cpp | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/src/gallium/state_trackers/clover/api/event.cpp >>> b/src/gallium/state_trackers/clover/api/event.cpp >>> index 5d1a0e52c5..bb7f6ed9f0 100644 >>> --- a/src/gallium/state_trackers/clover/api/event.cpp >>> +++ b/src/gallium/state_trackers/clover/api/event.cpp >>> @@ -126,7 +126,10 @@ clSetEventCallback(cl_event d_ev, cl_int type, >>> void *user_data) try { >>> auto &ev = obj(d_ev); >>> >>> - if (!pfn_notify || type != CL_COMPLETE) >>> + if (!pfn_notify || >>> + (type != CL_COMPLETE && type != CL_SUBMITTED && >>> + type != CL_QUEUED && type != CL_RUNNING >> >> Redundant line break. Also I don't think CL_QUEUED should be accepted. > > Yeah, you're right about CL_QUEUED. I'll remove that before submitting > to the ML. Just to note: The CTS for 1.2 does specifically test for > CL_SUBMITTED/CL_RUNNING/CL_COMPLETED. > > Regarding the line break, I can remove it. I just like to line up my > opening/closing parentheses that way, which also happens to be > consistent with the programmatically-enforced coding standards for > what I do in my day job. Some habits are hard to break. > > --Aaron > >> >>> + )) >>> throw error(CL_INVALID_VALUE); >>> >>> // Create a temporary soft event that depends on ev, with >>> -- >>> 2.11.0 _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev