When multiview is enabled, queries must begin and end in the
same subpass and N consecutive queries are implicitly used,
where N is the number of views enabled in the subpass.
Implementations decide how results are split across queries.
In our case, only one query is really used, but we still need
to flag all N queries as available by the time we flag the one
we use so that the application doesn't receive unexpected errors
when trying to retrieve values from them.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12034>
The Vulkan spec states that when multiview is enabled the number of
layers in the framebuffer is set to one and that each attachment
must then have at least as many layers as referenced by view masks
in the subpasses in which is used.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12034>
With multilayered framebuffer we want to allocate enough tile state for
all layers involved, so te binner can handle layered rendering where
a geometry shader is used to redirect primitives to specific layers by
writing to gl_Layer.
However, we may also have layered framebuffers in cases where layered
rendering won't be used. Typically this will happen for meta copy/clear
operations, where we setup multilayered framebuffers but then we just
load and/or store the tile buffer without ever rendering a primitive,
let alone use a geometry shader to do layered rendering. In these cases
we can reduce the amount of tile state allocated to a sigle layer.
This patch allows us to specify if we should allocate tile state for all
layers when we start a new frame. We will take advantage of this in
later patches targetting the meta copy/clear code paths.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11923>
This was required to handle the case of secondary command buffers that
did not have framebuffer information available from the primary, since
we used to have an implementation that required this information to
be available for the fallback path of vkCmdClearAttachments. Now that
we can handle our our attachment clears in the current subpass by
emitting draw calls, we no longer need the framebuffer information to
be available and we can remove this.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11783>
Fix defect reported by Coverity Scan.
Side effect in assertion (ASSERT_SIDE_EFFECT)
assignment_where_comparison_intended: Assignment deviceMask = 1U
has a side effect. This code will work differently in a non-debug
build.
Fixes: 234e1b7356 ("v3dv: implement VK_KHR_device_group")
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11197>
This was added with VK_KHR_device_group and allows users to specify
a base offset that will be automatically added to gl_WorkGroupID.
Unfortunately, V3D doesn't support this natively, so we need to add
the base to the workgroup id generated by hardware manually. For this,
we inject add instructions that source from a QUNIFORM that will
retrieve the actual dispatch base from the compute job when it is
dispatched.
Since a compute shader can be dispatched with CmdDispatch and/or
CmdDispatchBase, we always need to add these additional add
instructions and use a base of (0,0,0) for regular dispatches.
Since we don't support any version of OpenGL with this dispatch
base functionality we can avoid the extra instructions there.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11037>
When a TSY barrier is hit, the entire supergroup will be synchronized.
If the supergoup is large and uses all available QPU threads it would
mean that we would sychronize and stall all running threads until all
of them reach the barrier, which may be inefficient.
This patch makes it so that if the compute shader has any such barriers
we limit the supergroup size so each supergroup only takes half of the
QPU threads available at most, so that if one supergroup hits a
barrier we have at least one other supergroup we can run, reducing
idle QPU time.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10541>
Each supergroup executes a number batches. Each batch has 16 elements
(one per QPU lane), except possibly the last batch which might be
incomplete. Until now, we packed a single workgroup in each supergroup,
which can lead to more incomplete batches and less efficient use
of the QPUs depending on the configuration of workgroups being dispatched.
This patch computes a number of workgroups per supergroup so that
we reduce or completely eliminate incomplete batches if possible.
It should be noted however, that TSY barriers act on supergroups,
so larger supergroups lead to larger syncpoints on barriers too.
A follow-up patch will try to find a good balance for compute shaders
that use such barriers.
This improves performance of the Sascha Willem's computecloth demo
by ~13%.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10541>
We were using the pipeline layout to discard uniform updates for
stages that don't use descriptors, but we can do better by keeping
track of the stages used by the specific dirty descriptor sets and
only update uniforms for stages that are included in those.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10283>
Dedicated BOs waste memory and are also a significant cause of CPU
overhead when applications use hundreds of them per frame due to
all the work the kernel has to do to page in all these BOs for a job.
The UE4 Vehicle demo was hitting this causing it to freeze and stutter
under 1fps.
The hardware allows us to setup groups of 16 queries in consecutive
4-byte addresses, requiring only that each group of 16 queries is
aligned to a 1024 byte boundary. With this change, we allocate all
the queries in a pool in a single BO and we assign them different
offsets based on the above restriction. This eliminates the freezes
and stutters in the Vehicle sample.
One caveat of this solution is that we can only wait or test for
completion of a query by testing if the GPU is still using its BO,
which basically means that we can only wait for all active queries
in a pool to complete and not just the ones being requested by the
API. Since the Vulkan recommendation is to use a different query
pool per frame this should not be a big issue though.
If this ever becomes a problem (for example if an application does't
follow the recommendation and instead allocates a single pool and
splits its queries between frames), we could try to group queries
in a pool into a number of BOs to try and find a balance, but for
now this should work fine in most cases.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10253>
This can be called outside a render pass so we should not expect to have
a job available. Also, we should not be emitting state here, instead we
should do in the pre-draw handler with all the other draw call state.
Fixes cases of crashes in RenderDoc when selecting elements in the
Event Browser.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10130>
We are providing a BO with the default attribute values for the
GL_SHADER_STATE_RECORD, that contains 16 vec4. Such default value for
each vec4 is (0, 0, 0, 1). As the attribute format could be int or
float, the "1" value needs to take into account the attribute format.
But in the practice, the most common case is all floats. So we create
one default attribute values BO assuming that all attributes will be
floats, and we store it at v3dv_device and only create a new one if a
int format type is defined. That allows to reduce the amount of BOs
needed.
Note that we could still try to reduce the amount of BOs used by the
pipelines if we create a bigger BO, and we just play with the
offsets. But as mentioned, that's not the usual, and would add an
extra complexity,so it is not a priority right now.
This makes the following test passing when disabling the pipeline
cache support:
dEQP-VK.api.object_management.max_concurrent.graphics_pipeline
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9845>
So for example, on v3dv_CmdDrawIndexed we can return early if
instanceCount is 0.
This fixes failures when using the simulator with tests with the
following pattern:
dEQP-VK.draw.instanced.draw_indexed_vk_primitive_topology*
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9820>
Until now we were always doing a two-step cache lookup, as we were
using the NIR shaders to fill up the key to lookup for the compiled
shaders. But since we were already generating the sha1 key with the
original SPIR-V shader (or its internal NIR representation) any info
we were collecting from from NIR is already implicit in the original
shader, so we can avoid using the NIR in most cases.
Because the v3d_key that is used to compile a shader is populated with
data coming directly from the NIR shader or produced during NIR
lowerings, we can't use it directly as part of the pipeline cache
entry. We could split them, but that would be confusing, so we add a
new struct, v3dv_pipeline_key used specifically to search for the
compiled shaders on the pipeline cache. v3d_key would be still used to
compile the shaders.
As we are using the same sha1 key for all compiled shaders in a
pipeline, we can also group all of them in the same cache entry, so we
don't need a lookup for each stage. This also allows to cache pipeline
data shared by all the stages (like the descriptor maps).
While we are here, we also create a single BO to store the assembly
for all the pipeline stages.
Finally, we remove the link to the variant on the pipeline stage
struct, to avoid the confusion of having two links to the same
data. This mostly means that we stop to use the pipeline stage
structures after the pipeline is created, so we can freed them.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9403>