For Tiger Lake and onward, we generally don't need to ambiguate the CCS
before accessing it. This is safe for two reasons:
- Tiger Lake and onward treat all CCS values as legal.
- We enable compression on all writable image layouts. The CCS will
receive all writes and will therefore always be valid.
When dealing with modifiers, we continue to allow ambiguates in some
instances.
Before this patch, I found ~19.5k ambiguates in Wolfenstein:
Youngblood's Riverside benchmark (note that this includes manually
entering the benchmark and exiting the app). With this patch, the number
of ambiguates goes down to zero.
Improves performance of Fallout 4 at 1080p/High settings on Arc A380 by
around 22%.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20118>
Fixes a number of CTS patterns on DG2 :
- dEQP-VK.dynamic_rendering.primary_cmd_buff.random*
- dEQP-VK.draw.*secondary_cmd*
- dEQP-VK.dynamic_rendering.*secondary_cmd*
- dEQP-VK.geometry.*secondary_cmd_buffer
- dEQP-VK.multiview.*secondary_cmd*
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 9c1c1888d9 ("intel/fs: put scratch surface in the surface state heap")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19946>
In 4ceaed7839 we made scratch surface state allocations part of the
internal heap (mapped to STATE_BASE_ADDRESS::SurfaceStateBaseAddress)
so that it doesn't uses slots in the application's expected 1M
descriptors (especially with vkd3d-proton).
But all our compiler code relies on BSS
(STATE_BASE_ADDRESS::BindlessSurfaceStateBaseAddress).
The additional issue is that there is only 26bits of surface offset
available in CS instruction (CFE_STATE, 3DSTATE_VS, etc...) for
scratch surfaces. So we need the drivers to put the scratch surfaces
in the first chunk of STATE_BASE_ADDRESS::SurfaceStateBaseAddress
(hence all the driver changes).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 4ceaed7839 ("anv: split internal surface states from descriptors")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7687
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19727>
On Intel HW we use the same mechanism for internal operations surfaces
as well as application surfaces (VkDescriptor).
This change splits the surface pool in 2, one part dedicated to
internal allocations, the other to application VkDescriptors.
To do so, the STATE_BASE_ADDRESS::SurfaceStateBaseAddress points to a
4Gb area, with the following layout :
- 1Gb of binding table pool
- 2Gb of internal surface states
- 1Gb of bindless surface states
That way any entry from the binding table can refer to both internal &
bindless surface states but none of the driver allocations interfere
with the allocation of the application.
Based off a change from Sviatoslav Peleshko.
v2: Allocate image view null surface state from bindless heap (Sviatoslav)
Removed debug stuff (Sviatoslav)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7110
Cc: mesa-stable
Tested-by: Sviatoslav Peleshko <sviatoslav.peleshko@globallogic.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19275>
Zink on Anv running Gfxbench gl_driver2 is significantly slower than
Iris.
The reason is simple, whereas Iris implements uniform updates using
push constants and only has to emit 3DSTATE_CONSTANT_* packets, Zink
uses push descriptors with a uniform buffer, which on our
implementation use both push constants & binding tables.
Anv ends up doing the following for each uniform update :
- allocate 2 surface states :
- one for the uniform buffer as the offset specify by zink
- one for the descriptor set buffer
- pack the 2 RENDER_SURFACE_STATE
- re-emit binding tables
- re-emit push constants
Of all of those operations, only the last one ends up being useful in
this benchmark because all the uniforms have been promoted to push
constants.
This change defers the 3 first operations at draw time and executes
them only if the pipeline needs them.
Vkoverhead before / after :
descriptor_template_1ubo_push: 40670 / 85786
descriptor_template_12ubo_push: 4050 / 13820
descriptor_template_1combined_sampler_push, 34410 / 34043
descriptor_template_16combined_sampler_push, 2746 / 2711
descriptor_template_1sampled_image_push, 34765 / 34089
descriptor_template_16sampled_image_push, 2794 / 2649
descriptor_template_1texelbuffer_push, 108537 / 111342
descriptor_template_16texelbuffer_push, 20619 / 20166
descriptor_template_1ssbo_push, 41506 / 85976
descriptor_template_8ssbo_push, 6036 / 18703
descriptor_template_1image_push, 88932 / 89610
descriptor_template_16image_push, 20937 / 20959
descriptor_template_1imagebuffer_push, 108407 / 113240
descriptor_template_16imagebuffer_push, 32661 / 34651
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19050>
This change implements 3 states in one go:
- depth clamp enable
- depth clip enable
- depth clip negative one to one
This affects following packets:
3DSTATE_CLIP
3DSTATE_VIEWPORT_STATE_POINTERS_CC
3DSTATE_RASTER
v2: remove clip enable bit check from viewport emit (Lionel)
v3: use helper function from runtime to get depth clip (Lionel)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18879>
On DG2 the HW will fetch the binding entries into the cache
for every single thread when a compute walker is dispatched,
wiping out the advantages of the cache prefetch.
The spec also advises to not do a cache prefetch when we have more than
31 binding table entries, but most real world applications will never
hit that limit.
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18498>
Things are a bit confusing because we use the term "scratch" for 2
different things :
* the buffer for register allocation spilling
* the buffer for storing live values between splitted shaders around shader calls
Here we're fixing the missing register allocation spilling buffer.
v2: update comments (Caio)
fix scratch bo size computation with pipeline libraries (Lionel)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16970>
Intel has different Z interpolation float point rounding
than other mesa gpus
For example gl_Position.z = 0.0 will be interpolated to
gl_FragCoord.z = 0.5 for all gpus
gl_FragCoord = -0.00000001 will be interpolated to
gl_FragCoord.z = 0.4999999702 for Intel
and rounded to gl_FragCoord.z = 0.5 for other gpus
Games with LEQUAL depth func will fail depth test on Intel
and will pass it on other gpus in such case
This workaround lowers translated depth range
and several gl_FragCoord.z coords with extra small difference
will be translated to the same UINT16\UINT24\UINT32
value of an integer depth buffer
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7199
Signed-off-by: Illia Polishchuk <illia.a.polishchuk@globallogic.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18412>