nir_opt_algebraic is currently one of the most expensive NIR passes,
because of the many different patterns we've added over the years. Even
though patterns are already sorted by opcode, there are still way too
many patterns for common opcodes like bcsel and fadd, which means that
many patterns are tried but only a few actually match. One way to fix
this is to add a pre-pass over the code that scans it using an automaton
constructed beforehand, similar to the automatons produced by lex and
yacc for parsing source code. This automaton has to walk the SSA graph
and recognize possible pattern matches.
It turns out that the theory to do this is quite mature already, having
been developed for instruction selection as well as other non-compiler
things. I followed the presentation in the dissertation cited in the
code, "Tree algorithms: Two Taxonomies and a Toolkit," trying to keep
the naming similar. To create the automaton, we have to perform
something like the classical NFA to DFA subset construction used by lex,
but it turns out that actually computing the transition table for all
possible states would be way too expensive, with the dissertation
reporting times of almost half an hour for an example of size similar to
nir_opt_algebraic. Instead, we adopt one of the "filter" approaches
explained in the dissertation, which trade much faster table generation
and table size for a few more table lookups per instruction at runtime.
I chose the filter which resulted the fastest table generation time,
with medium table size. Right now, the table generation takes around .5
seconds, despite being implemented in pure Python, which I think is good
enough. Based on the numbers in the dissertation, the other choice might
make table compilation time 25x slower to get 4x smaller table size, but
I don't think that's worth it. As of now, we get the following binary
size before and after this patch:
text data bss dec hex filename
11979455 464720 730864 13175039 c908ff before i965_dri.so
text data bss dec hex filename
12037835 616244 791792 13445871 cd2aef after i965_dri.so
There are a number of places where I've simplified the automaton by
getting rid of details in the LHS patterns rather than complicate things
to deal with them. For example, right now the automaton doesn't
distinguish between constants with different values. This means that it
isn't as precise as it could be, but the decrease in compile time is
still worth it -- these are the compilation time numbers for a shader-db
run with my (admittedly old) database on Intel skylake:
Difference at 95.0% confidence
-42.3485 +/- 1.375
-7.20383% +/- 0.229926%
(Student's t, pooled s = 1.69843)
We can always experiment with making it more precise later.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
I'm not sure what triggered this, but building with
scons platform=windows toolchain=crossmingw machine=x86 build=profile
with MinGW g++ 7.3 or 7.4 causes an internal compiler error.
We can work around it by forcing -O1 optimization.
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
Reviewed-by: Neha Bhende <bhenden@vmware.com>
Use a different arrangement of constants to allow more ffma.
A vec4 backend will now use 3 fma for yuv_to_rgb. On freedreno/ir3, it is
down from 10 to 7 alu (4 fma, 3 mul, 3 add to 7 fma). Other backends
shouldn't be hurt.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Eric Anholt <eric@anholt.net>
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
The code was handling the Weak variant in some cases, but missing
others, e.g. the get_deref_nir_atomic_op. Add all the missing cases
with the same behavior of the non-Weak SpvOpAtomicCompareExchange.
Note that the Weak variant is basically an alias, as SPIR-V 1.3,
Revision 7 says
"OpAtomicCompareExchangeWeak
Deprecated (use OpAtomicCompareExchange).
Has the same semantics as OpAtomicCompareExchange."
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
One special case, `src/util/xmlpool/.gitignore` is not entirely deleted,
as `xmlpool.pot` still gets generated (eg. by `ninja xmlpool-pot`).
Signed-off-by: Eric Engestrom <eric.engestrom@intel.com>
Reviewed-by: Dylan Baker <dylan@pnwbakers.com>
From page 76 (page 80 of the PDF) of the GLSL 4.60 v.5 spec:
" No aliasing in output buffers is allowed: It is a compile-time or
link-time error to specify variables with overlapping transform
feedback offsets."
Currently, this is expected to fail, but it succeeds:
"
...
layout (xfb_offset = 0) out vec2 a;
layout (xfb_offset = 0) out vec4 b;
...
"
Fixes the following piglit test:
tests/spec/arb_enhanced_layouts/compiler/transform-feedback-layout-qualifiers/xfb_offset/invalid-overlap.vert
Fixes the following test:
KHR-GL44.enhanced_layouts.xfb_output_overlapping
v2:
- Use a data structure to track the used components instead of a
nested loop (Ilia).
v3:
- Take the BITSET_WORD array out from the
gl_transform_feedback_buffer struct and make it local to the
validation process (Timothy).
- Do not use a nested scope for the validation (Timothy).
v4:
- Add reference to the fixed piglit test in the commit log.
- Add reference to the fixed VK-GL-CTS test in the commit
log (Tapani).
- Empty initialize the BITSET_WORD pointers array (Tapani).
Cc: Timothy Arceri <tarceri@itsqueeze.com>
Cc: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Andres Gomez <agomez@igalia.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
On some hardware (e.g. Mali400) the shader needs to apply some
transformations for correct gl_FragCoord handling. The lowering
actions look like the following in pseudocode:
gl_FragCoord.xyz = gl_FragCoord_orig.xyz
gl_FragCoord.w = 1.0 / gl_FragCoord_orig.w
Add this lowering as a nir pass in preparation for using it in the driver.
Signed-off-by: Andreas Baierl <ichgeh@imkreisrum.de>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
fixes following warning with clang:
warning: suggest braces around initialization of subobject
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
Used same syntax as elsewhere with Mesa sources, verified result
against MSVC with godbolt.org.
fixes following warning with clang:
warning: suggest braces around initialization of subobject
v2: empty braces -> braces around subobject (Caio, Kristian)
Signed-off-by: Tapani Pälli <tapani.palli@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Reviewed-by: Kristian H. Kristensen <hoegsberg@google.com>
These have been popping up more and more with the OpenCL work and other
bits causing extra conversions to/from 64-bit.
Reviewed-by: Karol Herbst <kherbst@redhat.com>
This fixes a case where we are expecting 64-bit but generate
32-bit consts and validate gets angry.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Dave Airlie <airlied@redhat.com>
This fixes KHR-GL45.compute_shader.resources-max on radeonsi.
Fixes: 4e1e8f684b "glsl: remember which SSBOs are not read-only and pass it to gallium"
v2: use is_interface_array, protect again assertion failures in u_bit_consecutive
Reviewed-by: Dave Airlie <airlied@redhat.com>
Calculates i,j at specified offset within a pixel. A new load_size_ir3
intrinsic is used in conjunction with fddx/fddy to translate the offset
into primitive space and adjust the i,j from load_barycentric_pixel
accordingly.
Signed-off-by: Rob Clark <robdclark@chromium.org>
We already add the LOD src, so go ahead and update the texop as well
when this option is set.
v2: Make it an option. (Rob Clark)
v3: Use a more concise name suggested by Jason.
Reviewed-by: Rob Clark <robdclark@gmail.com>
We were only setting the used mask for the first component of a
varying. Since the linking opts split vectors into scalars this
has mostly worked ok.
However this causes an issue where for example if we split a
struct on one side of the interface but not the other, then we
can possibly end up removing the first components on the side
that was split and then incorrectly remove the whole struct
on the other side of the varying.
With this change we simply mark all 4 components for each slot
used by a struct. We could possibly make this more fine gained
but that would require a more complex change.
This fixes a bug in Strange Brigade on RADV when tessellation
is enabled, all credit goes to Samuel Pitoiset for tracking down
the cause of the bug.
Fixes: f1eb5e6399 ("nir: add component level support to remove_unused_io_vars()")
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Every file that included glsl/ir.h had a warning like:
src/compiler/glsl/ir.h: In member function ‘virtual bool ir_rvalue::is_lvalue(const _mesa_glsl_parse_state*) const’:
src/compiler/glsl/ir.h:236:64: warning: unused parameter ‘state’ [-Wunused-parameter]
virtual bool is_lvalue(const struct _mesa_glsl_parse_state *state = NULL) const
^
Cc: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Fixes: fa4ebf6b8d ("glsl: add _mesa_glsl_parse_state object to is_lvalue()")
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Commit 624789e370 moved the destruction of types out of atexit() and
made use of a ref count instead. This is useful for avoiding a crash
where drivers such as radeonsi are still compiling in a thread when the app
exits and has not called MakeCurrent to change from the current context.
While the above scenario is technically an app bug we shouldn't crash.
However that change caused another race condition between the shader
compilation tread in radeonsi and context teardown functions.
This patch makes two changes to fix this new problem:
First we explicitly call _mesa_destroy_shader_compiler_types() when destroying
the st context rather than calling it indirectly via _mesa_free_context_data().
We do this as we must call it after st_destroy_context_priv() so that we don't
destory the glsl types before the compilation threads finish.
Next wait for the shader threads to finish in si_destroy_context() this
also means we need to call context destroy before destroying the queues
in si_destroy_screen().
Fixes: 624789e370 ("compiler/glsl: handle case where we have multiple users for types")
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
This operation decorate with an Id instead of a Literal or String.
It is used by HlslCounterBufferGOOGLE (provided by
SPV_GOOGLE_hlsl_functionality1). Even if we don't do anything with
that decoration, we must be able to parse SPIR-V that uses it.
Fixes: 891886da2f "spirv: Add no-op support for VK_GOOGLE_hlsl_functionality1"
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Decorations (and ExecutionModes) can have not only literals, but also
Ids associated with them. So rename the field to the more general
name "Operand" used by the spec.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
This reverts commit 9e0c744f07, which
regressed dEQP-GLES2.functional.uniform_api.random.3. It turns out
that the newly produced location is meaningless and impossible to
consume by drivers that want to look at gl_uniform_storage, so it's
probably better to leave it unset (0) than a number that looks usable.
Leave a tombstone^Wcomment to discourage the next person from making
the obvious looking fix.
See the next commit for a longer description of the problem.
This breaks tests/spec/glsl-1.10/execution/samplers/uniform-struct
on i965, which was originally fixed by the revert. The next commit
will fix it again.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We have a macro for this now; no reason to hand-roll it for derefs.
While we're here, move the NIR_DEFINE_CAST for derefs down to where all
the other ones are.
Reviewed-by: Eric Anholt <eric@anholt.net>
When we have a bindless sampler, we need an instruction header. Even in
SIMD8, this pushes the instruction over the sampler message size maximum
of 11 registers. Instead, we have to lower TXD to TXL.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Marcelo de Oliveira Filho <caio.oliveira@intel.com>
Compiler warns about overflow when assigning UINT64_MAX to something
smaller than a uin64_t:
src/compiler/nir/nir_constant_expressions.c:16909:50: warning: implicit conversion from 'unsigned long long' to 'uint1_t' (aka 'unsigned char') changes value from 18446744073709551615 to 255 [-Wconstant-conversion]
uint1_t dst = (src0 + src1) < src0 ? UINT64_MAX : (src0 + src1);
~~~ ^~~~~~~~~~
Shift UINT64_MAX down to the appropriate maximum value for the type
being assigned to.
Signed-off-by: Kristian H. Kristensen <hoegsberg@google.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
If we want to assert on found == true when the loop exits early, we
need to initialize it to false.
Signed-off-by: Kristian H. Kristensen <hoegsberg@chromium.org>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
When looking at the dEQP nested_struct_array_dynamic_index_fragment code
after lowering, I was horrified at the amount of adding and multiplying by
0 we were doing. The builder _imm helpers handle that for you so that the
following optimization passes have less work to do. Plus, it's easier to
read.
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
We were calcuating the offset for the field within the struct, and just
dropping it on the floor. Fixes a regression in
KHR-GLES3.shaders.struct.local.nested_struct_array_dynamic_index_fragment
and a few of its friends since the scratch lowering commit.
Fixes: e8e159e9df ("nir/deref: Add helpers for getting offsets")
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The mali utgard pp doesn't support a sign instruction.
In the ARM offline shader compiler, the sign function is implemented
using sub(gt(0.0, a), lt(0.0, a)).
This is a generic optimization, so implement it in the nir level when
lower_fsign is set, alongside the lowering for isign.
Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
All of the affected shaders are in Mad Max. I noticed this while
looking at some other things. I tried a couple similar patterns, but
the affect on cycles was general negative. It may be worth revisiting
this later.
v2: Rebase on 1-bit Boolean changes.
All Gen7+ platforms had similar results. (Skylake shown)
total instructions in shared programs: 15282073 -> 15282053 (<.01%)
instructions in affected programs: 1192 -> 1172 (-1.68%)
helped: 14
HURT: 0
helped stats (abs) min: 1 max: 2 x̄: 1.43 x̃: 1
helped stats (rel) min: 1.16% max: 2.17% x̄: 1.65% x̃: 1.39%
95% mean confidence interval for instructions value: -1.73 -1.13
95% mean confidence interval for instructions %-change: -1.91% -1.38%
Instructions are helped.
total cycles in shared programs: 372595954 -> 372594532 (<.01%)
cycles in affected programs: 11477 -> 10055 (-12.39%)
helped: 14
HURT: 0
helped stats (abs) min: 76 max: 122 x̄: 101.57 x̃: 104
helped stats (rel) min: 7.76% max: 15.62% x̄: 12.94% x̃: 14.78%
95% mean confidence interval for cycles value: -111.05 -92.09
95% mean confidence interval for cycles %-change: -14.90% -10.98%
Cycles are helped.
No changes on any Gen6 or earlier platforms.
Reviewed-by: Matt Turner <mattst88@gmail.com>