Vectorizing a loop by making it branchless
September 10, 2014
My home server runs Motion for detecting movement on my webcam. Because it's an underpowered Atom machine, the image processing was responsible for roughly 12% constant load. That was a bit much for such a simple task, I thought, so I decided to profile Motion with Callgrind to get an idea of where the bottlenecks were. I hoped that maybe a bit of vectorized code would ease them out. This seemed promising, because it appeared unlikely that the functions were already being autovectorized. The loops were full of data-dependent branches, and didn't look very optimizer-friendly.
To skip straight to the ending, my attempts at tackling the main bottleneck eventually paid off into a 2× speedup and matching lower load of around 6%. This is the story in lab journal form of how I achieved that result. It's also a step-by-step derivation of, and elaborate comment on, the final algorithm the vectorized code uses.
(I'll definitely use this format again if I ever need to do something like this. My first attempts at tackling this function failed, because I wasn't documenting my steps and would make small, subtle mistakes that invalidated the whole thing. Also, a test rig that compares the SSE work-in-progress to the known good function for the whole range of inputs is absolutely invaluable. Add some benchmarking to keep the personal satisfaction level up.)
So, running Callgrind and looking at the cost table showed that the most expensive
function was alg_update_reference_frame
. Here's the whole thing for
completeness. It may look intimidating, but exactly what is going on in image processing
terms is not relevant to us here. Also, we're only going to look at the part in the
for
loop.
/**
* alg_update_reference_frame
*
* Called from 'motion_loop' to calculate the reference frame
* Moving objects are excluded from the reference frame for a certain
* amount of time to improve detection.
*
* Parameters:
*
* cnt - current thread's context struct
* action - UPDATE_REF_FRAME or RESET_REF_FRAME
*
*/
#define ACCEPT_STATIC_OBJECT_TIME 10 /* Seconds */
#define EXCLUDE_LEVEL_PERCENT 20
void alg_update_reference_frame(struct context *cnt, int action)
{
int accept_timer = cnt->lastrate * ACCEPT_STATIC_OBJECT_TIME;
int i, threshold_ref;
int *ref_dyn = cnt->imgs.ref_dyn;
unsigned char *image_virgin = cnt->imgs.image_virgin;
unsigned char *ref = cnt->imgs.ref;
unsigned char *smartmask = cnt->imgs.smartmask_final;
unsigned char *out = cnt->imgs.out;
if (cnt->lastrate > 5) /* Match rate limit */
accept_timer /= (cnt->lastrate / 3);
if (action == UPDATE_REF_FRAME) { /* Black&white only for better performance. */
threshold_ref = cnt->noise * EXCLUDE_LEVEL_PERCENT / 100;
for (i = cnt->imgs.motionsize; i > 0; i--) {
/* Exclude pixels from ref frame well below noise level. */
if (((int)(abs(*ref - *image_virgin)) > threshold_ref) && (*smartmask)) {
if (*ref_dyn == 0) { /* Always give new pixels a chance. */
*ref_dyn = 1;
} else if (*ref_dyn > accept_timer) { /* Include static Object after some time. */
*ref_dyn = 0;
*ref = *image_virgin;
} else if (*out) {
(*ref_dyn)++; /* Motionpixel? Keep excluding from ref frame. */
} else {
*ref_dyn = 0; /* Nothing special - release pixel. */
*ref = (*ref + *image_virgin) / 2;
}
} else { /* No motion: copy to ref frame. */
*ref_dyn = 0; /* Reset pixel */
*ref = *image_virgin;
}
ref++;
image_virgin++;
smartmask++;
ref_dyn++;
out++;
} /* end for i */
} else { /* action == RESET_REF_FRAME - also used to initialize the frame at startup. */
/* Copy fresh image */
memcpy(cnt->imgs.ref, cnt->imgs.image_virgin, cnt->imgs.size);
/* Reset static objects */
memset(cnt->imgs.ref_dyn, 0, cnt->imgs.motionsize * sizeof(*cnt->imgs.ref_dyn));
}
}
The take-away is that this loops over all pixels individually, does a few checks
against masks and timers, and writes some outputs to the frames at *ref_dyn
and *ref
. The problem with this loop is that it can't easily be vectorized
(run in parallel over multiple pixels at the same time), because all the branches are
data-dependent. Each pixel can take a potentially different path through the
routine.
We tackle this by converting this into branchless code. Each branch becomes a masking operation. We calculate all outputs for all pixels in parallel, and composite the outputs into the result with bitmasks. We'll be doing much more work than we do for a single pixel, but still get a speedup because we knock out 16 pixels at a time.
Create and apply bitmasks
Let's start by taking just the loop and stripping out the comments. Let's untangle
data and logic by putting the conditions in the if
statements into their
own boolean variables:
for (i = cnt->imgs.motionsize; i > 0; i--)
{
int thresholdmask = ((int)(abs(*ref - *image_virgin)) > threshold_ref);
int includemask = (thresholdmask && !(*smartmask == 0));
int refdynzero = (*ref_dyn == 0)
int refdyntimer = (*ref_dyn > accept_timer);
int outzero = (*out == 0);
if (includemask) {
if (refdynzero) {
*ref_dyn = 1;
} else if (refdyntimer) {
*ref_dyn = 0;
*ref = *image_virgin;
} else if (!outzero) {
(*ref_dyn)++;
} else {
*ref_dyn = 0;
*ref = (*ref + *image_virgin) / 2;
}
} else {
*ref_dyn = 0;
*ref = *image_virgin;
}
ref++;
image_virgin++;
smartmask++;
ref_dyn++;
out++;
}
Make branch statements independent
Now let's turn the decision tree into a series of independent if
statements. The main if/else becomes a part contingent on if (includemask)
,
and a part for the inverse, if (!includemask)
. The other conditions get the
same treatment. Since none of the statements can be true at the same time, they have now
become independent of each other:
int thresholdmask = ((int)(abs(*ref - *image_virgin)) > threshold_ref);
int includemask = (thresholdmask && !(*smartmask == 0));
int refdynzero = (*ref_dyn == 0)
int refdyntimer = (*ref_dyn > accept_timer);
int outzero = (*out == 0);
if (includemask && refdynzero) {
*ref_dyn = 1;
}
if (includemask && !refdynzero && refdyntimer) {
*ref_dyn = 0;
*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
(*ref_dyn)++;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
*ref_dyn = 0;
*ref = (*ref + *image_virgin) / 2;
}
if (!includemask) {
*ref_dyn = 0;
*ref = *image_virgin;
}
Split into ref and ref_dyn
Because the statements are independent, duplicate and split them into statements that
set the value of *ref_dyn
, and statements that set the value of
*ref
:
if (includemask && refdynzero) {
*ref_dyn = 1;
}
if (includemask && !refdynzero && refdyntimer) {
*ref_dyn = 0;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
(*ref_dyn)++;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
*ref_dyn = 0;
}
if (!includemask) {
*ref_dyn = 0;
}
if (includemask && !refdynzero && refdyntimer) {
*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
*ref = (*ref + *image_virgin) / 2;
}
if (!includemask) {
*ref = *image_virgin;
}
Reorder steps for same values
Reorder the clauses so that like results are with like:
if (!includemask) {
*ref_dyn = 0;
}
if (includemask && !refdynzero && refdyntimer) {
*ref_dyn = 0;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
*ref_dyn = 0;
}
if (includemask && refdynzero) {
*ref_dyn = 1;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
(*ref_dyn)++;
}
if (!includemask) {
*ref = *image_virgin;
}
if (includemask && !refdynzero && refdyntimer) {
*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
*ref = (*ref + *image_virgin) / 2;
}
Group if statements for similar outcomes with an OR
Any if
conditions that lead to the same outcome can be
OR
'ed together:
if (!includemask || (includemask && !refdynzero && (refdyntimer || (!refdyntimer && outzero)))) {
*ref_dyn = 0;
}
if (includemask && refdynzero) {
*ref_dyn = 1;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
(*ref_dyn)++;
}
if (!includemask || (includemask && !refdynzero && refdyntimer)) {
*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
*ref = (*ref + *image_virgin) / 2;
}
Merge conditions
Observe that setting *ref_dyn
to 1
is identical to setting
it to 0
and incrementing it. Merge the condition in the if
statement that sets *ref_dyn
to 1
into both other
statements. This does mean that our if
branches are no longer independent:
first we have to apply expression x1, then expression x2. I've labeled the expressions
so I can refer to them later:
x1: if (!includemask || (includemask && (refdynzero || (!refdynzero && (refdyntimer || (!refdyntimer && outzero)))))) {
*ref_dyn = 0;
}
x2: if (includemask && (refdynzero || (!refdynzero && !refdyntimer && !outzero))) {
(*ref_dyn)++;
}
x3: if (!includemask || (includemask && !refdynzero && refdyntimer)) {
*ref = *image_virgin;
}
x4: if (includemask && !refdynzero && !refdyntimer && outzero) {
*ref = (*ref + *image_virgin) / 2;
}
Boolean simplification
It turns out, when you write out the truth tables, that (x | (~x & y)) == (x | y)
.
Using colors to indicate which parts of the expression to keep: (x | (~x & y)) == (x | y)
.
Let's use this to simplify the expressions above.
Expression x1
x1: (!includemask || (includemask && (refdynzero || (!refdynzero && (refdyntimer || (!refdyntimer && outzero))))))
x1: (!includemask || (refdynzero || (!refdynzero && (refdyntimer || (!refdyntimer && outzero)))))
x1: (!includemask || (refdynzero || (refdyntimer || (!refdyntimer && outzero))))
x1: (!includemask || (refdynzero || refdyntimer || outzero))
Apply De Morgan's law: (~a | ~b) == ~(a & b)
:
x1: !(includemask && !(refdynzero || refdyntimer || outzero))
Expression x2
x2: (includemask && (refdynzero || (!refdynzero && (!refdyntimer && !outzero))))
x2: (includemask && (refdynzero || (!refdyntimer && !outzero)))
Apply De Morgan's other law: (~a & ~b) == ~(a | b)
x2: (includemask && (refdynzero || !(refdyntimer || outzero)))
With an eye on the SSE instruction set, we like the AND NOT
construct,
so apply De Morgan again: (~a | ~b) == ~(a & b)
:
x2: (includemask && !(!refdynzero && (refdyntimer || outzero)))
Reorder the terms to get a nice AND NOT
statement:
x2: (includemask && !((refdyntimer || outzero) && !refdynzero))
Expression x3
x3: (!includemask || (includemask && (!refdynzero && refdyntimer)))
x3: (!includemask || (!refdynzero && refdyntimer))
Apply De Morgan to get an AND NOT
form:
x3: !(includemask && !(!refdynzero && refdyntimer))
Reorder:
x3: !(includemask && !(refdyntimer && !refdynzero))
Expression x4
SSE does not have a NOT
function, only AND NOT
. It's tricky
to concatenate AND NOT
expressions when operations have an arity of 2,
since (q && !x && !y && !z)
is not the same as
andnot(andnot(andnot(y, z), x), q)
. Avoid programmer headaches by replacing the
multiple AND NOT
expressions in x4 with:
x4: (includemask && !(refdynzero || refdyntimer) && outzero)
Output so far
Here's where we stand:
x1: if (!(includemask && !(refdynzero || refdyntimer || outzero))) {
*ref_dyn = 0;
}
x2: if (includemask && !((refdyntimer || outzero) && !refdynzero)) {
(*ref_dyn)++;
}
x3: if (!(includemask && !(refdyntimer && !refdynzero))) {
*ref = *image_virgin;
}
x4: if (includemask && !(refdynzero || refdyntimer) && outzero) {
*ref = (*ref + *image_virgin) / 2;
}
Expression x1 is basically an AND NOT
with *ref_dyn
:
where the condition is true, set *ref_dyn
to 0
. So it's an
AND
with the negation of the condition, which is convenient since
the condition already starts with a negation. Simplify to:
x1: *ref_dyn &= (includemask && !(refdynzero || refdyntimer || outzero));
Note that the expressions in x1 and x4 share common parts. Reorder the functions so that they are adjacent, to give the compiler a chance for optimization. Testing shows that this gives a very slight speedup. This leaves us with our final product:
x1: *ref_dyn &= (includemask && !(refdynzero || refdyntimer || outzero));
x4: if (includemask && !(refdynzero || refdyntimer) && outzero) {
*ref = (*ref + *image_virgin) / 2;
}
x2: if (includemask && !((refdyntimer || outzero) && !refdynzero)) {
(*ref_dyn)++;
}
x3: if (!(includemask && !(refdyntimer && !refdynzero))) {
*ref = *image_virgin;
}
This produces a function that should be equivalent to what we started with. Before starting work on the optimizations, I constructed a test harness in which I could hang both the original function and my replacements, and check if they gave the same outputs for all of the input range. I also added some timing code to get a handle on the exact speedups achieved. The benchmark suite shows that this function does give the same outcome as the original, and that its speed is virtually identical to the unoptimized version. That shouldn't be surprising since all we've done here is unroll the conditions, which is something any good compiler would also have done for the original code. The benefits come when we translate this to SSE intrinsics, which this code makes possible, and run the loop across 16 pixels at a time, using bitmasking instead of branching.
The result above is still not branchless. What we're after is this, but with SSE instructions:
unsigned char mask, outp;
*ref_dyn &= includemask & ~(refdynzero | refdyntimer | outzero);
mask = includemask & ~(refdynzero | refdyntimer) & outzero;
outp = (*ref + *image_virgin) / 2;
*ref = (*ref & ~mask) | (outp & mask);
mask = includemask & ~((refdyntimer | outzero) & ~refdynzero);
*ref_dyn = (*ref_dyn & ~mask) | ((*ref_dyn + 1) & mask);
mask = includemask & ~(refdyntimer & ~refdynzero);
*ref = (*image_virgin & ~mask) | (*ref & mask);
Notice how we changed logical operations to bitwise operations
here, on the assumption (true with SSE) that the masks will be either 0x00
or 0xFF
. Running this in the test jig shows that it gives identical results
to the original code, but is around 4× slower, depending on platform. I actually
found that disappointing since I thought the autovectorizer would eat this for
lunch.
End result
The final results can be seen at my GitHub repository, specifically in this file.
Because this code is not portable beyond x86, but Motion is commonly deployed on small ARM servers like the Raspberry Pi, I did some tests to see whether I could rewrite the algorithms using GCC's architecture-agnostic vector extensions. My proof-of-concept shows that this can be done, but no, it's not a win. It's slower than the "plain" code by four times or so, results vary.
The SSE intrinsics let you be clever and take advantage of things you can't express in plain C, like saturated subtraction, unpacking, builtin averaging, bitmask extraction and so on. With the vector extensions, the best you can do is spell these functions out in textbook fashion and hope the compiler catches the idiom. But neither GCC nor LLVM at the time of writing seem very sharp, or are not allowed to optimize for them as they would violate the letter of the C standard. So that was a dead end performance-wise.