This enables a couple of TBIMR performance tunables in
CHICKEN_RASTER_2 that default to disabled. TBIMR fast clip appears to
help slightly with some geometry-bound workloads. TBIMR open batch
allows the rasterizer to start working immediately on the first tile
of the framebuffer, even before the batch has been closed, which helps
reduce the latency cost of the tile walk.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25493>
This sets up the basic parameters needed for tiled rendering based on
a back-of-the-envelope estimate of the amount of memory used by the
pixel pipeline during the tile pass. The actual cache footprint of a
tile can vary wildly based on runtime factors which aren't easily
predictable based on static analysis, so this is only intended to
provide a rough approximation within the right order of magnitude.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25493>
This may help debugging performance problems in the possible case that
TBIMR negatively impacts the performance of some application. It could
also allow applying application-specific band-aid fixes in the XML file
until a more general workaround is implemented.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25493>
L3 configurations with an ALL partition of 128 ways per bank or more
cannot be represented with the normal L3ALLOC partitioning mechanism
since the "All L3 client pool" field would overflow, instead the
L3FullWayAllocationEnable bit has to be set, which causes the whole L3
to be used in a unified cache configuration.
That's precisely the configuration we're currently using on recent
platforms, but previously we were relying on the L3 config tables
being empty and the selected L3 configuration being a NULL pointer to
detect this condition. This is about change, the L3 configuration
structure will be defined for gfx12.5+ platforms since they provide
useful information about the cache hierarchy to the drivers. Instead
of checking whether the pointer is NULL in order to apply a unified L3
cache configuration, use it when there is a single ALL partition
larger than can be represented via L3ALLOC.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25493>
Doing something like this is allowed :
vkCreateGraphicsPipeline(.., scissorState, &pipeline);
vkCmdBindPipeline(pipeline);
vkCmdSetScissor(...)
vkCmdBindPipeline(pipeline)
If we don't reapply the pipeline dynamic state, the command buffer
runtime state will keep the dynamic state set in between the 2 binds.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25915>
This helps fix a performance regression on games such as F1 22 and RDR2.
Turning on non zero fast clears causes additional partial resolves for
these games that degrades performance. Let's turn off non zero fast
clears till we can eliminate the partial resolves.
Signed-off-by: Rohan Garg <rohan.garg@intel.com>
Reviewed-by: Nanley Chery <nanley.g.chery@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25863>
It was never actually async as it was doing a DRM_IOCTL_SYNCOBJ_WAIT
right after DRM_IOCTL_XE_VM_BIND but it was required to allow the
partial binds required by sparse.
But it is now fixed and we can switch back to sync vm bind.
In future we will switch back to async vm bind to improve performance
but this time it will be properly implemented.
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25300>
Sync xe_drm.h with commit xxxxx ("drm/xe/uapi: Fix naming of XE_QUERY_CONFIG_MAX_EXEC_QUEUE_PRIORITY").
One not so straght forward change is that sync VM binds now don't
require a syncobj anymore, the uAPI will return as soon the VM bind
operations are done.
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25300>
At image bind time, we require BOs to meet aux-map alignment
requirements in order to enable CCS on images. This is a heuristic
controlled by anv_bo_allows_aux_map().
To improve the chances of getting a properly aligned BO, we make use of
the dedicated allocation extension. Firstly, we report to applications a
preference for dedicated memory if an image would like to use the aux
map. Secondly, we align the VMA for dedicated allocations to meet
aux-map requirements.
To make enabling modifiers much easier on integrated gfx12, report
dedicated allocations as a requirement for modifiers which specify CCS.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1)
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25003>
At image bind time, if an image's addresses can be placed into the
aux-map without causing conflicts with a pre-existing mapping, do so.
The code aux management code in the binding function operates on a
per-plane basis. So, use the per-plane CCS memory range from the image
rather than the CCS memory region for the entire BO.
Another way to avoid aux-map conflicts is to rely solely on having a
dedicated allocation for an image. Unfortunately, not all workloads
change their behavior when drivers report a preference for dedicated
allocations. In particular, 3DMark Wild Life Extreme does not make more
dedicated allocations and such a solution was measured to perform ~16%
worse than this solution. With this solution, I did not measure a loss
of CCS on that benchmark.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/6304
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1)
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25003>
Make intel_aux_map_add_mapping return false if a mapping is attempted
that would conflict with an existing one. If this function doesn't
return false, it will either fail to return or return true.
The Vulkan driver will make use of this feature to opportunistically
enable CCS if a BO's VMA range has not been already mapped.
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25003>
We need to shut down the runtime queue threads before tearing down
anything else.
Gets rid of helgrind errors like this :
==212772== Possible data race during write of size 4 at 0xADCBFB0 by thread #1
==212772== Locks held: 1, at address 0x6B8F260
==212772== at 0x8AC3EFF: simple_mtx_destroy (simple_mtx.h:97)
==212772== by 0x8ACB24D: intel_ds_device_fini (intel_driver_ds.cc:603)
==212772== by 0x6CBD4D4: anv_device_utrace_finish (anv_utrace.c:471)
==212772== by 0x6C71577: anv_DestroyDevice (anv_device.c:3679)
==212772== by 0x6B2F1E2: loader_layer_destroy_device (loader.c:4358)
==212772== by 0x6B3F10B: vkDestroyDevice (trampoline.c:983)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: cc5843a573 ("anv: implement u_trace support")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10010
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25805>
The following instructions :
- 3DSTATE_VIEWPORT_STATE_POINTERS_SF_CLIP
- 3DSTATE_VIEWPORT_STATE_POINTERS_CC
- 3DSTATE_SCISSOR_STATE_POINTERS
do not care about the content/format/count of the render targets, only
the size of the render area and count of viewport/scissor.
By tracking render targets & render area we can reduce the emission of
those instructions.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25778>
Heuristic-based optimization throttling CCS work (async compute).
Without throttling, background compute work consumes all threads,
deminishing performance gains by running dispatch in parallel with
3D work.
Optimization is heuristics based, meaning a workload might slow
down when using async compute.
Best value: PixelAsyncComputeThreadLimit = 4. On DG2, this
equates to a max CCS thread occupancy of 37.5%.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25508>