We have job-level retry on failure now, and will continue to need to in
order to work around fd.o infrastructure flakes. If we stop doing retry
inside the job, then we can crank down the gitlab-level timeouts on test
jobs to be closer to our CI guidelines and avoid blocking a runner for an
hour when things go wrong (for example, cheza #16 failing to boot in a
recognized way and continuously looping due to the intra-job retry).
Plus, the job logs will be more readable when you don't have two boots in
one job, and we'll get the flakes surfaced in our monitoring dashboards.
If internal retries were really doing useful work we may see an increase
in flakes as a result of this. I'm committing to turning off boards or
reducing coverage as necessary to handle this.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25790>
This should help with "marge got stuck for an hour and all I got was this
failed job with no results/" when a system intermittently wedges.
This replaces the BM_POE_TIMEOUT ("did we get something on serial in the
last 3 minutes?") that rpi had, in favor of checking that the whole test
job gets through in 20 minutes.
Acked-by: Juan A. Suarez <jasuarez@igalia.com>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17096>
The a530s will occasionally fail to make it to the fastboot prompt,
with no other deltas between a working log and a log stalled waiting
for that line to show up.
So, add a serial timeout (like the rpi boards do for similar reasons),
and on timeout restart the run. We actually restart the whole serial
watching process, because the SerialBuffer finishes itself on timeout.
This should also help with the intermittent issue we've had where a
power cycle causes the python serial module to throw an exception.
Tested with the gitlab-disabled db820c that never makes it to the
fastboot prompt (I think it's one where we need a longer micro cable
to connect it!) and saw successful boot looping to retry.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11308>
Modeling after what I did for cros_servo_run.py, this gives us easy
support for restarting the test run a530 when we detect a spontaneous
reboot. I had to touch up serial_buffer.py to handle buffering in from a
file instead of a serial device, to support the upcoming etnaviv CI
(tested by running it against a serial log from db410c and seeing it step
to calling "fastboot")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6529>