V3D 4.x allows more flexibility, so take advantage of that. Generally,
we can reorder any writes in the same sequence, so long as they are
not the sequence terminator (which must always be last, since it is
the one triggering the operation), and TMUD writes, since these must
be ordered with respect to each other.
total instructions in shared programs: 13735183 -> 13731927 (-0.02%)
instructions in affected programs: 903057 -> 899801 (-0.36%)
helped: 2358
HURT: 746
Instructions are helped.
total max-temps in shared programs: 2322020 -> 2322009 (<.01%)
max-temps in affected programs: 619 -> 608 (-1.78%)
helped: 19
HURT: 11
Inconclusive result (value mean confidence interval includes 0).
total sfu-stalls in shared programs: 31494 -> 31489 (-0.02%)
sfu-stalls in affected programs: 182 -> 177 (-2.75%)
helped: 40
HURT: 40
Inconclusive result (value mean confidence interval includes 0).
total inst-and-stalls in shared programs: 13766677 -> 13763416 (-0.02%)
inst-and-stalls in affected programs: 901343 -> 898082 (-0.36%)
helped: 2349
HURT: 746
Inst-and-stalls are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9555>
Instead of using a write depdency. We use last_tmu_config to ensure ordering
of instructions participating in different TMU sequences. To this end,
all sequence terminators flag a write dependency on last_tmu_config, but
wrtmuc is not a sequence terminator, so we can be more flexible by flagging
it as a read depedency. This would prevent it to be moved into a previous
sequence (since it cannot be moved past the previous sequence terminator due
to the read depedency), but it allows it to be reordered with instructions in
the same sequence, which allows us to pair it up more effectively. Particularly,
it allows to pair up a wrtmuc with the sequence terminator of the same sequence,
turning code like this:
nop ; mov tmut, r0 ; thrsw; wrtmuc (tex[0].p0 | 0x3)
nop ; nop ; wrtmuc (tex[0].p1 | 0x0)
nop ; mov tmus, r1
Into this:
nop ; mov tmut, r0 ; thrsw; wrtmuc (tex[0].p0 | 0x3)
nop ; mov tmus, r1 ; wrtmuc (tex[0].p1 | 0x0)
total instructions in shared programs: 13755738 -> 13735183 (-0.15%)
instructions in affected programs: 2510921 -> 2490366 (-0.82%)
helped: 10963
HURT: 485
Instructions are helped.
total max-temps in shared programs: 2322828 -> 2322020 (-0.03%)
max-temps in affected programs: 11303 -> 10495 (-7.15%)
helped: 608
HURT: 19
Max-temps are helped.
total sfu-stalls in shared programs: 31545 -> 31494 (-0.16%)
sfu-stalls in affected programs: 235 -> 184 (-21.70%)
helped: 62
HURT: 11
Sfu-stalls are helped.
total inst-and-stalls in shared programs: 13787283 -> 13766677 (-0.15%)
inst-and-stalls in affected programs: 2525187 -> 2504581 (-0.82%)
helped: 10989
HURT: 477
Inst-and-stalls are helped.
v2: add a comment explaining the read depdency (Piñeiro).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9555>
we already have the batch usage info here for the resource, so if we know
the resource is already used on the batch then we don't need to also perform
a hash lookup to double check that it's really there
Reviewed-by: Dave Airlie <airlied@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9565>
this was only needed to cover up some other bugs:
* missing barriers for buffer sampler/image descriptors
* weirdness with first frame handling
there's better ways of handling both cases, and now they're handled better
Reviewed-by: Dave Airlie <airlied@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9566>
The disadvantages of the DYNAMIC path over the
non-dynamic path are minor.
The advantages are many.
As we don't know if bad behaving apps use
non-dynamic SYSTEMMEM in a dynamic fashion,
let's be safe and always be dynamic.
Signed-off-by: Axel Davy <davyaxel0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9451>
SW vertex processing buffers are supposed to be sorted in RAM
and to be immediately idle after use (thus you can write at the
same location again immediately).
DYNAMIC SYSTEMMEM is by far the best fit for now for these kind
of buffers, though it can be improved further. Indeed the use
pattern will cause a lot of syncs with csmt actived.
Thus disable csmt when full sw vertex processing is requested.
Signed-off-by: Axel Davy <davyaxel0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9451>
Some apps use DYNAMIC SYSTEMMEM buffers and fill them in a
dynamic fashion with discard and nooverwrite locking flags.
To prevent uploading the whole buffer every draw call,
track the region needed for the draw call, and
upload only that region (or a bit more in order
to ease valid region tracking).
Try to aggressively upload with discard/unsynchronized.
Signed-off-by: Axel Davy <davyaxel0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9451>
So far we did nothing on EndScene, but the API
doc says it flushes the GPU command queue.
The doc implies one can optimize CPU usage
by calling EndScene long before Present() is called.
Implementing the flush behaviour gives me +15-20%
on the CPU limited Halo. On the other hand, do limit
the flush to only once per frame.
3DMark03/3Mark05 get a 2% perf hit with the patch,
but 5% if I allow more flushes.
Signed-off-by: Axel Davy <davyaxel0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9451>
The enum is defined in EXT_framebuffer_object (available on desktop),
and was copied to OES_framebuffer_object (available in GLES1). The ES2
spec has no mention of such an enum. If the underlying implementation
does not support this, it will set a generic incomplete error (as is
done in st_cb_fbo.c if mixed_formats == false).
This should fix the following dEQP tests on ES2 drivers:
dEQP-GLES2.functional.fbo.completeness.attachment_combinations.rbo_tex_none_none
dEQP-GLES2.functional.fbo.completeness.attachment_combinations.tex_rbo_none_none
Fixes: fd017458bc (mesa: fix fbo attachment size check for RBs, make it trigger in ES2)
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/4444
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Eric Anholt <eric@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9572>
In compat GL, the gl_FrontColor/BackColor gets clamped unless you
explicitly turn it off with ARB_color_buffer_float, and most apps doing
floating point color rendering are going to be using non-compat varyings
anyway instead of hoping that ARB_color_buffer_float exists. Thus, guess
that the app will get clamping on the color outputs to reduce draw-time
recompiles.
Saves 60 VS recompiles on half-life-2.trace on freedreno (and similarly for
many other compat GL apps).
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9509>