Address loading can often end up as shl + shr + shl combinations. The
latter two are equal shifts, which get converted into an and mask.
However if the previous shl is more than the mask is trying to remove
(in terms of low bits), we can just remove the and entirely. This
reduces some large shaders by as many as 3% of instructions (out of 2K).
total instructions in shared programs : 6495509 -> 6491076 (-0.07%)
total gprs used in shared programs : 954621 -> 954623 (0.00%)
local gpr inst bytes
helped 0 0 1014 1014
hurt 0 2 0 0
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
We don't need to support all the color buffers for advanced blend, just
cb0. For Fermi, we use the special binding slots so that we don't
overlap with user textures, while Kepler+ gets a dedicated position for
the fb handle in the driver constbuf.
This logic is only triggered when a FBFETCH is actually present so it
should be a no-op most of the time.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
This is so that we can differentiate between flushing any framebuffer
reading caches from regular sampler caches.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The existing lowering is in place to lower that to RCP + MUL, or fancier
things down the line if necessary.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Renaming data sources was added in
e8bb97ce30
It was possible to use a new name longer than
the name array in hud_graph of 128. This
patch truncates the name to fit the array.
CC: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Marek Olšák <marek.olsak@amd.com>
It should be close to the GPU load, but it can be much lower if something
is stalling shader execution (e.g. CP DMA).
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
This will flush the pipeline,which will allow to share dma-buf based
buffers.
Signed-off-by: Suresh Guttula <Suresh.Guttula@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Currently, we can store up to 256 immediates in a static array,
but this is not always enough. Instead, allocate a dynamic array
like what we currently do for temps.
This fixes a segfault with
dEQP-GLES31.functional.ssbo.layout.random.all_shared_buffer.23
No regressions found with full piglit run.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
There's a levelZero flag which forces texturing to pick level zero (and
not consume an explicit LOD argument). This is set for MS targets, but
could also be set for any other incoming instruction. As that is what
determines whether a LOD argument is present, check that rather than the
more indirect isMS logic.
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Ever since a long time ago when I messed around with fences, I ensure
that after a PUSH_SPACE call there is enough space to write a fence out
into the pushbuf.
However the PUSH_SPACE macro is not all-knowing, and so sometimes we
have to invoke nouveau_pushbuf_space manually with the relocs/pushes
args set. If we don't take the extra allocation from PUSH_SPACE into
account, then we will end up accidentally flushing when the code was not
expecting a flush. This can lead to various runtime and rendering
failures.
The amount of extra allocation isn't that important - it has to be at
least 8 based on the current nouveau_winsys.h setting, but even more
won't hurt. I just rounded up to powers of 2.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99354
Cc: "12.0 13.0" <mesa-stable@lists.freedesktop.org>
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Acked-by: Ben Skeggs <bskeggs@redhat.com>
Code is taken from a combination of radv (for the more basic functions,
to avoid gallivm dependencies) and radeonsi (for the new and improved
derivative calculations).
v2: add 0.5 offset to tex coords only after derivative calculation
v3:
- really only touch the first three coordinates
- rebase on the removal of the 1.5 --> 0.5 offset change
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> (v2)
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Sourcing coords_arg[4] is actually never correct, since bias is handled
differently in tex_fetch_args anyway.
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
As remarked by the comment in the original code, the old algorithm fails when
(tc + deriv) points at a different cube face. Instead, simply project the
derivative directly to the plane of the selected cube face.
The new code is based on exactly differentiating (using the chain rule)
the projection onto a plane corresponding to a fixed cube map face (which
is still selected in the usual way based on the texture coordinate itself).
The computations end up fairly involved, but we do save two reciprocal
computations.
Fixes GL45-CTS.texture_cube_map_array.sampling.
v2: add 0.5 offset to tex coords only after derivative calculation
v3: go back to 1.5 offset
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> (v2)
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
b838f642 "ac/debug: Move sid_tables.h generation to common code." moved
sid_tables.h but forgot the corresponding .gitignore.
Signed-off-by: Grazvydas Ignotas <notasas@gmail.com>
Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
The generated file is correctly stored in the builddir as of earlier
commit. Yet the commit forgot to add the respective include flag thus
the compiler would error out failing to find sid_tables.h
Bugzila: https://bugs.freedesktop.org/show_bug.cgi?id=99389
Fixes: d1dc22eb46 "ac: automake: rework sid_tables.h generation"
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
When the flag D3DCREATE_MULTITHREAD is set, a global mutex is used
to protect nine calls.
However for performance reasons, AddRef and Release didn't hold the mutex,
and instead used atomics.
Unfortunately at item release, the item can be destroyed, and that
destruction path should be protected by a mutex (at least for
some objects).
Without this patch, it is possible an app thread is in a dtor
while another thread is making gallium nine calls. It is possible
that two threads are using the same gallium pipe, which is forbiden.
The problem has been made worse with csmt, because it can cause hang,
since nine_csmt_process is not threadsafe.
Fixes Hitman hang, and possibly others.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
Flush the queue to get refcounts right, and properly
release the items, instead of throwing away all pending
commands.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
nine_context uses NineSurface9 fields, thus we need to flush
pending commands using the surface before changing the fields.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
There is no need to check on csmt_active before
calling nine_csmt_process, because the function
checks already.
Signed-off-by: Axel Davy <axel.davy@ens.fr>
When dirty region is empty, u_box_union_* incorrectly expands
the new region.
This fixes broken font rendering issue in WOLF RPG Editor v2.10 games.
Signed-off-by: Masanori Kakura <kakurasan@gmail.com>
Reviewed-by: Axel Davy <axel.davy@ens.fr>
Changes from V1 -> V2:
- updated Copyright
- added $(top_srcdir)/src/gallium/winsys to include path (suggested by Emil)
- adapted driver to new renderonly API
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Acked-by: Emil Velikov <emil.velikov@collabora.com>
This driver supports a wide range of Vivante IP cores like GC880,
GC1000, GC2000 and GC3000.
Changes from V1 -> V2:
- added missing files to actually integrate the driver into build system.
- adapted driver to new renderonly API
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Wladimir J. van der Laan <laanwj@gmail.com>
Acked-by: Emil Velikov <emil.velikov@collabora.com>
This a very lightweight library to add basic support for renderonly
GPUs. A kms gallium driver must specify how a renderonly_scanout
objects gets created. Also it must provide file handles to the used
kms device and the used gpu device.
This could look like:
struct renderonly ro = {
.create_for_resource = renderonly_create_gpu_import_for_resource,
.kms_fd = fd,
.gpu_fd = open("/dev/dri/renderD128", O_RDWR | O_CLOEXEC)
};
The renderonly_scanout object exits for two reasons:
- Do any special treatment for a scanout resource like importing the
GPU resource into the scanout hw.
- Make it easier for a gallium driver to detect if anything special
needs to be done in flush_resource(..) like a resolve to linear.
A GPU gallium driver which gets used as renderonly GPU needs to be
aware of the renderonly library.
This library will likely break android support and hopefully will get
replaced with a better solution based on gbm2.
Changes from V1 -> V2:
- reworked the lifecycle of renderonly object (suggested by Nicolai Hähnle)
- killed the midlayer (suggested by Thierry Reding)
- made the API more explicit regarding gpu and kms fd's
- added some docs
Signed-off-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Acked-by: Emil Velikov <emil.velikov@collabora.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Defer delete on regular resources. This ensures that any work being done
on the resource is completed before freeing up the resource's memory.
Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>
Although, arb_shader_image_load_store-atomicity will most likely
hang your box, I think it's now quite reasonable to enable GL 4.3
on Maxwell/Pascal GPUs. I suspect that test to be wrong because
it doesn't even work on the NVIDIA blob.
I have tested a bunch of benchmarks (UE4 demos) and real games
like Shadow of Mordor and they all work fine.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>