For rgbx formats, there is no point in doing alpha conversion again (and
with different tranpose even, so llvm can't eliminate it).
Albeit it looks like there's some minimal changes needed in the blend code
(found by code inspection, no test seemed to complain) if we do this -
the blend factors are already sanitized if we have no destination alpha,
however for src_alpha_saturate it looks like it still might make a
difference (note that we forced has_alpha to true before for some formats
and nothing complained, but this seems safer).
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
llvm has _huge_ problems trying to load things like <4 x i8> vectors and
stitching such loads together to form 128bit vectors. My understanding
of the problem is that the type legalizer tries to extend that to
really a <4 x i32> vector and not a <16 x i8> vector with the 4 elements
first then followed by padding, so the shuffles for then combining things
together are more or less impossible - you can in fact see the pmovzxd
llvm generates. Pre-4.0 llvm just gives up on it completely and does a 30+
pextrb/pinsrb sequence instead.
It looks like current llvm has fixed this behavior (my guess would be
due to better shuffle combination and load/shuffle folds), but we can
avoid this by just loading as <1 x i32> values, combine that and only
cast at the end. (I suspect it might also work if we'd pad the loaded
vectors immediately before shuffling them together, instead of directly
stitching 2 such vectors together pairwise before combining the pair.
But this _might_ lose the ability to load the values directly into
their right place in the vector with pinsrd.). But using 32bit values
is probably easier for llvm as it will never give it funny ideas how
the vector should look like.
(This is possibly only a problem for 1x8bit formats, since 2x8bit will
end up fetching 64bit hence only two vectors are stitched together,
not 4, but we use the same strategy anyway.)
Reviewed-by: Jose Fonseca <jfonseca@vmware.com>
The only reason we didn't previously enable this was the dependency on
OpenGL ES 3.1. These should have been enabled as soon as HSW got
stencil texturing. We also needed to fixup setting MaxViewports.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Since 9d6ca7c3, there should be no performance hit for having
MaxViewports > 1. Always set this context state. This eliminates the
need to update this conditional as we add support for OES_viewport_array
on older GPUs.
Signed-off-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
The CS thread is needed to ensure proper ordering of operations and can't
be disabled (without complicating the code).
Discovered by Nine CSMT, which ended up in a deadlock.
Acked-by: Edward O'Callaghan <funfunctor@folklore1984.net>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Fixes a number of transform feedback tests when run with Linux 4.8,
which allows us to use the MI_LOAD_REGISTER_REG command, at which point
we started using this new broken path.
ES3-CTS.functional.transform_feedback.array_element.interleaved.lines.*
and Piglit's arb_transform_feedback2/draw-auto are both fixed by this
patch, for example.
Thanks to Chris Wilson for catching this mistake!
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99030
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com>
We were failing to zero m0.2 of the sampler message header for TCS and
GS messages in the simple case. fs_generator has done this for about
a year now, but we missed it in vec4_generator.
Fixes ES31-CTS.core.texture_cube_map_array.sampling,
GL45-CTS.texture_cube_map_array.sampling, and many
dEQP-GLES31.functional.shaders.opaque_type_indexing.sampler subtests:
- dynamically_uniform.tessellation_control.isampler3d
- dynamically_uniform.tessellation_control.isamplercube
- dynamically_uniform.tessellation_control.sampler2d
- dynamically_uniform.tessellation_control.usamplercube
- dynamically_uniform.tessellation_control.sampler2darray
- dynamically_uniform.tessellation_control.isampler2darray
- dynamically_uniform.tessellation_control.usampler3d
- dynamically_uniform.tessellation_control.usampler2darray
- dynamically_uniform.tessellation_control.usampler2d
- dynamically_uniform.tessellation_control.sampler3d
- dynamically_uniform.tessellation_control.samplercube
- dynamically_uniform.tessellation_control.isampler2d
- uniform.tessellation_control.isampler3d
- uniform.tessellation_control.isamplercube
- uniform.tessellation_control.usampler2d
- uniform.tessellation_control.usampler3d
- uniform.tessellation_control.sampler2darray
- uniform.tessellation_control.isampler2darray
- uniform.tessellation_control.usampler2darray
- uniform.tessellation_control.sampler2d
- uniform.tessellation_control.usamplercube
- uniform.tessellation_control.sampler3d
- uniform.tessellation_control.samplercube
- uniform.tessellation_control.isampler2d
Cc: mesa-stable@lists.freedesktop.org
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Fix incorrect swizzling in SIMD16 Transpose_16_16 breaking the
two-channel 16-bpc formats like R16G16_FLOAT.
Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
A while ago, we stopped using Luca's GLSL IR lower_jumps pass in favor
of nir_lower_returns(). Marek's commit d3cb79e043
put it in do_common_optimization, which resulted in us calling it again.
Dropping the EmitNoMainReturn setting makes us skip that pass again.
Apparently that pass doesn't work properly, because this fixes Piglit's
tests/spec/glsl-1.10/execution/vs-nested-return-sibling-loop.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99287
Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Reviewed-by: Timothy Arceri <timothy.arceri@collabora.com>
The T images are composed of effectively swizzled-around blocks of LT (4x4
utile) images, so we can reduce the t_utile_address() calls by 16x by
calling into the simpler LT loop.
This also adds support for calling down with non-utile-aligned
coordinates, which will be part of lifting the utile alignment requirement
on our callers and avoiding the RMW on non-utile-aligned stores.
Improves 1024x1024 TexSubImage by 2.55014% +/- 1.18584% (n=46)
Improves 1024x1024 GetTexImage by 2.242% +/- 0.880954% (n=32)
They now have less of a dependency on the cpp, and don't have to do a
divide.
Hacking up mesa-demos teximage to do only one subtest and not draw
points, I saw 1024x1024 glTexSubImage2D() improve by 4.86939% +/-
1.40408% (n=30) and glGetTexImage() by 2.18978% +/- 0.140268% (n=5).
If we get up toward 256MB (or whatever the CMA area size is),
VC4_GEM_CREATE will start throwing errors. Even if we don't trigger
that, when we flush the kernel's BO allocation for the CLs or bin
memory may end up throwing an error, at which point our job won't get
rendered at all.
Just flush early (half of maximum CMA size) so that hopefully we never
get to that point.
This will help allow us to simplify the handling of samplers by
storing them in a single location rather than duplicating them in
both gl_linked_shader and gl_program.
Reviewed-by: Eric Anholt <eric@anholt.net>
Now that we create gl_program earlier there is no need to mess about
copying things to gl_linked_shader then to gl_program.
Reviewed-by: Eric Anholt <eric@anholt.net>
There is no need to loop over active samplers the code above this
would have already exited if the sampler was inactive, or errored
if the count was larger than the uniforms array size.
Reviewed-by: Eric Anholt <eric@anholt.net>
Making this point to a gl_program struct rather than a gl_shader_program
struct will allow use to later also make the CurrentProgram array hold
gl_program structs which in turn will allow for code simpilifcation.
Reviewed-by: Eric Anholt <eric@anholt.net>
Now that we have the is_arb_asm flag we can just skip the
initialisation.
V2: remove hack from standalone compiler where it was never
needed since it only compiles glsl shaders.
Reviewed-by: Eric Anholt <eric@anholt.net>
Set the flag via the _mesa_init_gl_program() and NewProgram()
helpers.
In i965 we currently check for the existance of gl_shader_program
to decide if this is an ARB assembly style program or not.
Adding a flag makes the code clearer and will help removes a
dependency on gl_shader_program in the i965 codegen functions.
Also this will allow use to skip initialising sampler units for
linked shaders, we currently memset it to zero again during linking.
Reviewed-by: Eric Anholt <eric@anholt.net>
We can now just get the data needed from the gl_shader_program_data
pointer in gl_program.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Having it here rather than in gl_linked_shader allows us to simplify
the code.
Also it is error prone to depend on the gl_linked_shader for programs
in current use because a failed linking attempt will free infomation
about the current program. In i965 we could be trying to recompile
a shader variant but may have lost some required fields.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Here we also remove the duplicate field in gl_linked_shader and always
get the value from shader_info instead.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This will help allow us to store pointers to gl_program structs in the
CurrentProgram array resulting in a bunch of code simplifications.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
This also removes the duplicate field in gl_linked_shader, and
gets num_ubos from shader_info instead.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>