On Thu, Nov 22, 2018 at 6:24 AM Nicolai Hähnle <nhaeh...@gmail.com> wrote:
> On 21.11.18 21:27, Marek Olšák wrote: > > On Wed, Nov 21, 2018 at 12:57 PM Nicolai Hähnle <nhaeh...@gmail.com > > <mailto:nhaeh...@gmail.com>> wrote: > > > > From: Nicolai Hähnle <nicolai.haeh...@amd.com > > <mailto:nicolai.haeh...@amd.com>> > > > > Introduce a new driver-private transfer flag > RADEON_TRANSFER_TEMPORARY > > that specifies whether the caller will use buffer_unmap or not. The > > default behavior is set to permanent maps, because that's what > drivers > > do for Gallium buffer maps. > > > > This should eliminate the need for hacks in libdrm. Assertions are > added > > to catch when the buffer_unmap calls don't match the (temporary) > > buffer_map calls. > > > > I did my best to update r600 and r300 as well for completeness (yes, > > it's a no-op for r300 because it never calls buffer_unmap), even > though > > the radeon winsys ignores the new flag. > > > > > > You didn't make any changes to r300. > > Yeah, that's what I wrote :) > > > > You can also drop all r600 changes, because the radeon winsys doesn't > care. > > I don't think it's a good idea, though. The interface of the two winsys > is different, yes, but it's largely the same and it makes sense to keep > it that way conceptually. Not that it matters much for the code itself. > > > [snip] > > +enum radeon_transfer_flags { > > + /* Indicates that the caller will unmap the buffer. > > + * > > + * Not unmapping buffers is an important performance > > optimization for > > + * OpenGL (avoids kernel overhead for frequently mapped > > buffers). However, > > + * if you only map a buffer once and then use it indefinitely > > from the GPU, > > + * it is much better to unmap it so that the kernel is free to > > move it to > > + * non-visible VRAM. > > > > > > The second half of the comment is misleading. The kernel will move > > buffers to invisible VRAM regardless of whether they're mapped, so CPU > > mappings have no effect on the placement. Buffers are only moved back to > > CPU-accessible memory on a CPU page fault. If a buffer is mapped and > > there no CPU access, it will stay in invisible VRAM forever. The general > > recommendation is to keep those buffers mapped for CPU access just like > > GTT buffers. > > Yeah, I'll change that. > > > > + */ > > + RADEON_TRANSFER_TEMPORARY = (PIPE_TRANSFER_DRV_PRV << 0), > > +}; > > + > > #define RADEON_SPARSE_PAGE_SIZE (64 * 1024) > > > > enum ring_type { > > RING_GFX = 0, > > RING_COMPUTE, > > RING_DMA, > > RING_UVD, > > RING_VCE, > > RING_UVD_ENC, > > RING_VCN_DEC, > > @@ -287,23 +299,26 @@ struct radeon_winsys { > > struct pb_buffer *(*buffer_create)(struct radeon_winsys *ws, > > uint64_t size, > > unsigned alignment, > > enum radeon_bo_domain > domain, > > enum radeon_bo_flag flags); > > > > /** > > * Map the entire data store of a buffer object into the > > client's address > > * space. > > * > > + * Callers are expected to unmap buffers again if and only if > the > > + * RADEON_TRANSFER_TEMPORARY flag is set in \p usage. > > + * > > * \param buf A winsys buffer object to map. > > * \param cs A command stream to flush if the buffer is > > referenced by it. > > - * \param usage A bitmask of the PIPE_TRANSFER_* flags. > > + * \param usage A bitmask of the PIPE_TRANSFER_* and > > RADEON_TRANSFER_* flags. > > * \return The pointer at the beginning of the buffer. > > */ > > void *(*buffer_map)(struct pb_buffer *buf, > > struct radeon_cmdbuf *cs, > > enum pipe_transfer_usage usage); > > > > /** > > * Unmap a buffer object from the client's address space. > > * > > * \param buf A winsys buffer object to unmap. > > diff --git a/src/gallium/drivers/radeonsi/si_shader.c > > b/src/gallium/drivers/radeonsi/si_shader.c > > index 19522cc97b1..d455fb5db6a 100644 > > --- a/src/gallium/drivers/radeonsi/si_shader.c > > +++ b/src/gallium/drivers/radeonsi/si_shader.c > > @@ -5286,21 +5286,22 @@ int si_shader_binary_upload(struct si_screen > > *sscreen, struct si_shader *shader) > > 0 : > > SI_RESOURCE_FLAG_READ_ONLY, > > PIPE_USAGE_IMMUTABLE, > > align(bo_size, > > SI_CPDMA_ALIGNMENT), > > 256); > > if (!shader->bo) > > return -ENOMEM; > > > > /* Upload. */ > > ptr = sscreen->ws->buffer_map(shader->bo->buf, NULL, > > PIPE_TRANSFER_READ_WRITE | > > - > PIPE_TRANSFER_UNSYNCHRONIZED); > > + PIPE_TRANSFER_UNSYNCHRONIZED > | > > + RADEON_TRANSFER_TEMPORARY); > > > > /* Don't use util_memcpy_cpu_to_le32. LLVM binaries are > > * endian-independent. */ > > if (prolog) { > > memcpy(ptr, prolog->code, prolog->code_size); > > ptr += prolog->code_size; > > } > > if (previous_stage) { > > memcpy(ptr, previous_stage->code, > > previous_stage->code_size); > > ptr += previous_stage->code_size; > > diff --git a/src/gallium/include/pipe/p_defines.h > > b/src/gallium/include/pipe/p_defines.h > > index 693f041b1da..e99895d30d8 100644 > > --- a/src/gallium/include/pipe/p_defines.h > > +++ b/src/gallium/include/pipe/p_defines.h > > @@ -334,21 +334,27 @@ enum pipe_transfer_usage > > */ > > PIPE_TRANSFER_PERSISTENT = (1 << 13), > > > > /** > > * If PERSISTENT is set, this ensures any writes done by the > > device are > > * immediately visible to the CPU and vice versa. > > * > > * PIPE_RESOURCE_FLAG_MAP_COHERENT must be set when creating > > * the resource. > > */ > > - PIPE_TRANSFER_COHERENT = (1 << 14) > > + PIPE_TRANSFER_COHERENT = (1 << 14), > > + > > + /** > > + * This and higher bits are reserved for private use by drivers. > > Drivers > > + * should use this as (PIPE_TRANSFER_DRV_PRV << i). > > + */ > > + PIPE_TRANSFER_DRV_PRV = (1 << 24) > > }; > > > > /** > > * Flags for the flush function. > > */ > > enum pipe_flush_flags > > { > > PIPE_FLUSH_END_OF_FRAME = (1 << 0), > > PIPE_FLUSH_DEFERRED = (1 << 1), > > PIPE_FLUSH_FENCE_FD = (1 << 2), > > diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c > > b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c > > index 9f0d4c12482..355018c76fc 100644 > > --- a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c > > +++ b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c > > @@ -49,20 +49,21 @@ > > struct amdgpu_sparse_backing_chunk { > > uint32_t begin, end; > > }; > > > > static struct pb_buffer * > > amdgpu_bo_create(struct radeon_winsys *rws, > > uint64_t size, > > unsigned alignment, > > enum radeon_bo_domain domain, > > enum radeon_bo_flag flags); > > +static void amdgpu_bo_unmap(struct pb_buffer *buf); > > > > static bool amdgpu_bo_wait(struct pb_buffer *_buf, uint64_t > timeout, > > enum radeon_bo_usage usage) > > { > > struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf); > > struct amdgpu_winsys *ws = bo->ws; > > int64_t abs_timeout; > > > > if (timeout == 0) { > > if (p_atomic_read(&bo->num_active_ioctls)) > > @@ -166,20 +167,26 @@ static void amdgpu_bo_remove_fences(struct > > amdgpu_winsys_bo *bo) > > bo->max_fences = 0; > > } > > > > void amdgpu_bo_destroy(struct pb_buffer *_buf) > > { > > struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf); > > struct amdgpu_winsys *ws = bo->ws; > > > > assert(bo->bo && "must not be called for slab entries"); > > > > + if (!bo->is_user_ptr && bo->cpu_ptr) { > > + bo->cpu_ptr = NULL; > > + amdgpu_bo_unmap(&bo->base); > > + } > > + assert(bo->is_user_ptr || bo->u.real.map_count == 0); > > + > > if (ws->debug_all_bos) { > > simple_mtx_lock(&ws->global_bo_list_lock); > > LIST_DEL(&bo->u.real.global_list_item); > > ws->num_buffers--; > > simple_mtx_unlock(&ws->global_bo_list_lock); > > } > > > > simple_mtx_lock(&ws->bo_export_table_lock); > > util_hash_table_remove(ws->bo_export_table, bo->bo); > > simple_mtx_unlock(&ws->bo_export_table_lock); > > @@ -188,54 +195,66 @@ void amdgpu_bo_destroy(struct pb_buffer *_buf) > > amdgpu_va_range_free(bo->u.real.va_handle); > > amdgpu_bo_free(bo->bo); > > > > amdgpu_bo_remove_fences(bo); > > > > if (bo->initial_domain & RADEON_DOMAIN_VRAM) > > ws->allocated_vram -= align64(bo->base.size, > > ws->info.gart_page_size); > > else if (bo->initial_domain & RADEON_DOMAIN_GTT) > > ws->allocated_gtt -= align64(bo->base.size, > > ws->info.gart_page_size); > > > > - if (bo->u.real.map_count >= 1) { > > - if (bo->initial_domain & RADEON_DOMAIN_VRAM) > > - ws->mapped_vram -= bo->base.size; > > - else if (bo->initial_domain & RADEON_DOMAIN_GTT) > > - ws->mapped_gtt -= bo->base.size; > > - ws->num_mapped_buffers--; > > - } > > - > > simple_mtx_destroy(&bo->lock); > > FREE(bo); > > } > > > > static void amdgpu_bo_destroy_or_cache(struct pb_buffer *_buf) > > { > > struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf); > > > > assert(bo->bo); /* slab buffers have a separate vtbl */ > > > > if (bo->u.real.use_reusable_pool) > > pb_cache_add_buffer(&bo->u.real.cache_entry); > > else > > amdgpu_bo_destroy(_buf); > > } > > > > +static bool amdgpu_bo_do_map(struct amdgpu_winsys_bo *bo, void > **cpu) > > +{ > > + assert(!bo->sparse && bo->bo && !bo->is_user_ptr); > > + int r = amdgpu_bo_cpu_map(bo->bo, cpu); > > + if (r) { > > + /* Clear the cache and try again. */ > > + pb_cache_release_all_buffers(&bo->ws->bo_cache); > > + r = amdgpu_bo_cpu_map(bo->bo, cpu); > > + if (r) > > + return false; > > + } > > + > > + if (p_atomic_inc_return(&bo->u.real.map_count) == 1) { > > + if (bo->initial_domain & RADEON_DOMAIN_VRAM) > > + bo->ws->mapped_vram += bo->base.size; > > + else if (bo->initial_domain & RADEON_DOMAIN_GTT) > > + bo->ws->mapped_gtt += bo->base.size; > > + bo->ws->num_mapped_buffers++; > > + } > > + > > + return true; > > +} > > + > > static void *amdgpu_bo_map(struct pb_buffer *buf, > > struct radeon_cmdbuf *rcs, > > enum pipe_transfer_usage usage) > > { > > struct amdgpu_winsys_bo *bo = (struct amdgpu_winsys_bo*)buf; > > struct amdgpu_winsys_bo *real; > > struct amdgpu_cs *cs = (struct amdgpu_cs*)rcs; > > - int r; > > - void *cpu = NULL; > > - uint64_t offset = 0; > > > > assert(!bo->sparse); > > > > /* If it's not unsynchronized bo_map, flush CS if needed and > > then wait. */ > > if (!(usage & PIPE_TRANSFER_UNSYNCHRONIZED)) { > > /* DONTBLOCK doesn't make sense with UNSYNCHRONIZED. */ > > if (usage & PIPE_TRANSFER_DONTBLOCK) { > > if (!(usage & PIPE_TRANSFER_WRITE)) { > > /* Mapping for read. > > * > > @@ -306,63 +325,74 @@ static void *amdgpu_bo_map(struct pb_buffer > *buf, > > } > > > > amdgpu_bo_wait((struct pb_buffer*)bo, > > PIPE_TIMEOUT_INFINITE, > > RADEON_USAGE_READWRITE); > > } > > > > bo->ws->buffer_wait_time += os_time_get_nano() - time; > > } > > } > > > > - /* If the buffer is created from user memory, return the user > > pointer. */ > > - if (bo->user_ptr) > > - return bo->user_ptr; > > + /* Buffer synchronization has been checked, now actually map the > > buffer. */ > > + void *cpu = NULL; > > + uint64_t offset = 0; > > > > if (bo->bo) { > > real = bo; > > } else { > > real = bo->u.slab.real; > > offset = bo->va - real->va; > > } > > > > - r = amdgpu_bo_cpu_map(real->bo, &cpu); > > - if (r) { > > - /* Clear the cache and try again. */ > > - pb_cache_release_all_buffers(&real->ws->bo_cache); > > - r = amdgpu_bo_cpu_map(real->bo, &cpu); > > - if (r) > > - return NULL; > > + if (usage & RADEON_TRANSFER_TEMPORARY) { > > + if (real->is_user_ptr) { > > + cpu = real->cpu_ptr; > > + } else { > > + if (!amdgpu_bo_do_map(real, &cpu)) > > + return NULL; > > + } > > + } else { > > + cpu = real->cpu_ptr; > > + if (!cpu) { > > > > > > I think this is unsafe on 64-bit CPUs, because the lower 32 bits can be > > initialized but not the upper 32 bits. You would either have to check > > both 32-bit halves separately (if they are both expected to be non-zero) > > or you would have to lock the mutex first. > > Well, there is actually a potential problem, but it's not that. A 64-bit > CPU will do aligned 64-bit loads/stores atomically, but > amdgpu_bo_cpu_map doesn't necessarily guarantee that it'll set the > address atomically. I think it does that today, but it's a bad idea to > depend on it. > > > > If all my remarks are addressed, I'll ack this, but I still think that > > this patch adds unnecessary cruft, complexity, and fragility in order to > > fix an issue that is very trivial in nature. The overflow check in > > libdrm is simple, clean, and robust. > > The overflow check, or actually the saturating behavior, is also very > surprising and unusual. Reference counting doesn't usually work that way. > It's not reference counting. It's not even the map count. It's the map call count. (I wrote the original code) Marek
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev