Vectorizing a loop by making it branchless

September 10, 2014

My home server runs Motion for detecting movement on my webcam. Because it's an underpowered Atom machine, the image processing was responsible for roughly 12% constant load. That was a bit much for such a simple task, I thought, so I decided to profile Motion with Callgrind to get an idea of where the bottlenecks were. I hoped that maybe a bit of vectorized code would ease them out. This seemed promising, because it appeared unlikely that the functions were already being autovectorized. The loops were full of data-dependent branches, and didn't look very optimizer-friendly.

To skip straight to the ending, my attempts at tackling the main bottleneck eventually paid off into a 2× speedup and matching lower load of around 6%. This is the story in lab journal form of how I achieved that result. It's also a step-by-step derivation of, and elaborate comment on, the final algorithm the vectorized code uses.

(I'll definitely use this format again if I ever need to do something like this. My first attempts at tackling this function failed, because I wasn't documenting my steps and would make small, subtle mistakes that invalidated the whole thing. Also, a test rig that compares the SSE work-in-progress to the known good function for the whole range of inputs is absolutely invaluable. Add some benchmarking to keep the personal satisfaction level up.)

So, running Callgrind and looking at the cost table showed that the most expensive function was alg_update_reference_frame. Here's the whole thing for completeness. It may look intimidating, but exactly what is going on in image processing terms is not relevant to us here. Also, we're only going to look at the part in the for loop.

/** 
 * alg_update_reference_frame
 *
 *   Called from 'motion_loop' to calculate the reference frame
 *   Moving objects are excluded from the reference frame for a certain
 *   amount of time to improve detection.
 * 
 * Parameters:
 *
 *   cnt    - current thread's context struct
 *   action - UPDATE_REF_FRAME or RESET_REF_FRAME
 *
 */
#define ACCEPT_STATIC_OBJECT_TIME 10  /* Seconds */
#define EXCLUDE_LEVEL_PERCENT 20
void alg_update_reference_frame(struct context *cnt, int action) 
{
    int accept_timer = cnt->lastrate * ACCEPT_STATIC_OBJECT_TIME;
    int i, threshold_ref;
    int *ref_dyn = cnt->imgs.ref_dyn;
    unsigned char *image_virgin = cnt->imgs.image_virgin;
    unsigned char *ref = cnt->imgs.ref;
    unsigned char *smartmask = cnt->imgs.smartmask_final;
    unsigned char *out = cnt->imgs.out;

    if (cnt->lastrate > 5)  /* Match rate limit */
        accept_timer /= (cnt->lastrate / 3);

    if (action == UPDATE_REF_FRAME) { /* Black&white only for better performance. */
        threshold_ref = cnt->noise * EXCLUDE_LEVEL_PERCENT / 100;

        for (i = cnt->imgs.motionsize; i > 0; i--) {
            /* Exclude pixels from ref frame well below noise level. */
            if (((int)(abs(*ref - *image_virgin)) > threshold_ref) && (*smartmask)) {
                if (*ref_dyn == 0) { /* Always give new pixels a chance. */
                    *ref_dyn = 1;
                } else if (*ref_dyn > accept_timer) { /* Include static Object after some time. */
                    *ref_dyn = 0;
                    *ref = *image_virgin;
                } else if (*out) {
                    (*ref_dyn)++; /* Motionpixel? Keep excluding from ref frame. */
                } else {
                    *ref_dyn = 0; /* Nothing special - release pixel. */
                    *ref = (*ref + *image_virgin) / 2;
                }

            } else {  /* No motion: copy to ref frame. */
                *ref_dyn = 0; /* Reset pixel */
                *ref = *image_virgin;
            }

            ref++;
            image_virgin++;
            smartmask++;
            ref_dyn++;
            out++;
        } /* end for i */

    } else {   /* action == RESET_REF_FRAME - also used to initialize the frame at startup. */
        /* Copy fresh image */
        memcpy(cnt->imgs.ref, cnt->imgs.image_virgin, cnt->imgs.size);
        /* Reset static objects */
        memset(cnt->imgs.ref_dyn, 0, cnt->imgs.motionsize * sizeof(*cnt->imgs.ref_dyn));
    }
}

The take-away is that this loops over all pixels individually, does a few checks against masks and timers, and writes some outputs to the frames at *ref_dyn and *ref. The problem with this loop is that it can't easily be vectorized (run in parallel over multiple pixels at the same time), because all the branches are data-dependent. Each pixel can take a potentially different path through the routine.

We tackle this by converting this into branchless code. Each branch becomes a masking operation. We calculate all outputs for all pixels in parallel, and composite the outputs into the result with bitmasks. We'll be doing much more work than we do for a single pixel, but still get a speedup because we knock out 16 pixels at a time.

Create and apply bitmasks

Let's start by taking just the loop and stripping out the comments. Let's untangle data and logic by putting the conditions in the if statements into their own boolean variables:

for (i = cnt->imgs.motionsize; i > 0; i--)
{
	int thresholdmask = ((int)(abs(*ref - *image_virgin)) > threshold_ref);
	int includemask = (thresholdmask && !(*smartmask == 0));
	int refdynzero = (*ref_dyn == 0)
	int refdyntimer = (*ref_dyn > accept_timer);
	int outzero = (*out == 0);

	if (includemask) {
		if (refdynzero) {
			*ref_dyn = 1;
		} else if (refdyntimer) {
			*ref_dyn = 0;
			*ref = *image_virgin;
		} else if (!outzero) {
			(*ref_dyn)++;
		} else {
			*ref_dyn = 0;
			*ref = (*ref + *image_virgin) / 2;
		}

	} else {
		*ref_dyn = 0;
		*ref = *image_virgin;
	}

	ref++;
	image_virgin++;
	smartmask++;
	ref_dyn++;
	out++;
}

Make branch statements independent

Now let's turn the decision tree into a series of independent if statements. The main if/else becomes a part contingent on if (includemask), and a part for the inverse, if (!includemask). The other conditions get the same treatment. Since none of the statements can be true at the same time, they have now become independent of each other:

int thresholdmask = ((int)(abs(*ref - *image_virgin)) > threshold_ref);
int includemask = (thresholdmask && !(*smartmask == 0));
int refdynzero = (*ref_dyn == 0)
int refdyntimer = (*ref_dyn > accept_timer);
int outzero = (*out == 0);

if (includemask && refdynzero) {
	*ref_dyn = 1;
}
if (includemask && !refdynzero && refdyntimer) {
	*ref_dyn = 0;
	*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
	(*ref_dyn)++;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
	*ref_dyn = 0;
	*ref = (*ref + *image_virgin) / 2;
}
if (!includemask) {
	*ref_dyn = 0;
	*ref = *image_virgin;
}

Split into ref and ref_dyn

Because the statements are independent, duplicate and split them into statements that set the value of *ref_dyn, and statements that set the value of *ref:

if (includemask && refdynzero) {
	*ref_dyn = 1;
}
if (includemask && !refdynzero && refdyntimer) {
	*ref_dyn = 0;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
	(*ref_dyn)++;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
	*ref_dyn = 0;
}
if (!includemask) {
	*ref_dyn = 0;
}

if (includemask && !refdynzero && refdyntimer) {
	*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
	*ref = (*ref + *image_virgin) / 2;
}
if (!includemask) {
	*ref = *image_virgin;
}

Reorder steps for same values

Reorder the clauses so that like results are with like:

if (!includemask) {
	*ref_dyn = 0;
}
if (includemask && !refdynzero && refdyntimer) {
	*ref_dyn = 0;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
	*ref_dyn = 0;
}
if (includemask && refdynzero) {
	*ref_dyn = 1;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
	(*ref_dyn)++;
}

if (!includemask) {
	*ref = *image_virgin;
}
if (includemask && !refdynzero && refdyntimer) {
	*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
	*ref = (*ref + *image_virgin) / 2;
}

Group if statements for similar outcomes with an OR

Any if conditions that lead to the same outcome can be OR'ed together:

if (!includemask || (includemask && !refdynzero && (refdyntimer || (!refdyntimer && outzero)))) {
	*ref_dyn = 0;
}
if (includemask && refdynzero) {
	*ref_dyn = 1;
}
if (includemask && !refdynzero && !refdyntimer && !outzero) {
	(*ref_dyn)++;
}

if (!includemask || (includemask && !refdynzero && refdyntimer)) {
	*ref = *image_virgin;
}
if (includemask && !refdynzero && !refdyntimer && outzero) {
	*ref = (*ref + *image_virgin) / 2;
}

Merge conditions

Observe that setting *ref_dyn to 1 is identical to setting it to 0 and incrementing it. Merge the condition in the if statement that sets *ref_dyn to 1 into both other statements. This does mean that our if branches are no longer independent: first we have to apply expression x1, then expression x2. I've labeled the expressions so I can refer to them later:

x1:	if (!includemask || (includemask && (refdynzero || (!refdynzero && (refdyntimer || (!refdyntimer && outzero)))))) {
		*ref_dyn = 0;
	}
x2:	if (includemask && (refdynzero || (!refdynzero && !refdyntimer && !outzero))) {
		(*ref_dyn)++;
	}

x3:	if (!includemask || (includemask && !refdynzero && refdyntimer)) {
		*ref = *image_virgin;
	}
x4:	if (includemask && !refdynzero && !refdyntimer && outzero) {
		*ref = (*ref + *image_virgin) / 2;
	}

Boolean simplification

It turns out, when you write out the truth tables, that (x | (~x & y)) == (x | y). Using colors to indicate which parts of the expression to keep: (x | (~x & y)) == (x | y). Let's use this to simplify the expressions above.

Expression x1

x1:	(!includemask || (includemask && (refdynzero || (!refdynzero && (refdyntimer || (!refdyntimer && outzero))))))

x1:	(!includemask || (refdynzero || (!refdynzero && (refdyntimer || (!refdyntimer && outzero)))))

x1:	(!includemask || (refdynzero || (refdyntimer || (!refdyntimer && outzero))))

x1:	(!includemask || (refdynzero || refdyntimer || outzero))

Apply De Morgan's law: (~a | ~b) == ~(a & b):

x1:	!(includemask && !(refdynzero || refdyntimer || outzero))

Expression x2

x2:	(includemask && (refdynzero || (!refdynzero && (!refdyntimer && !outzero))))

x2:	(includemask && (refdynzero || (!refdyntimer && !outzero)))

Apply De Morgan's other law: (~a & ~b) == ~(a | b)

x2:	(includemask && (refdynzero || !(refdyntimer || outzero)))

With an eye on the SSE instruction set, we like the AND NOT construct, so apply De Morgan again: (~a | ~b) == ~(a & b):

x2:	(includemask && !(!refdynzero && (refdyntimer || outzero)))

Reorder the terms to get a nice AND NOT statement:

x2:	(includemask && !((refdyntimer || outzero) && !refdynzero))

Expression x3

x3:	(!includemask || (includemask && (!refdynzero && refdyntimer)))

x3:	(!includemask || (!refdynzero && refdyntimer))

Apply De Morgan to get an AND NOT form:

x3:	!(includemask && !(!refdynzero && refdyntimer))

Reorder:

x3:	!(includemask && !(refdyntimer && !refdynzero))

Expression x4

SSE does not have a NOT function, only AND NOT. It's tricky to concatenate AND NOT expressions when operations have an arity of 2, since (q && !x && !y && !z) is not the same as andnot(andnot(andnot(y, z), x), q). Avoid programmer headaches by replacing the multiple AND NOT expressions in x4 with:

x4:	(includemask && !(refdynzero || refdyntimer) && outzero)

Output so far

Here's where we stand:

x1:	if (!(includemask && !(refdynzero || refdyntimer || outzero))) {
		*ref_dyn = 0;
	}
x2:	if (includemask && !((refdyntimer || outzero) && !refdynzero)) {
		(*ref_dyn)++;
	}

x3:	if (!(includemask && !(refdyntimer && !refdynzero))) {
		*ref = *image_virgin;
	}
x4:	if (includemask && !(refdynzero || refdyntimer) && outzero) {
		*ref = (*ref + *image_virgin) / 2;
	}

Expression x1 is basically an AND NOT with *ref_dyn: where the condition is true, set *ref_dyn to 0. So it's an AND with the negation of the condition, which is convenient since the condition already starts with a negation. Simplify to:

x1:	*ref_dyn &= (includemask && !(refdynzero || refdyntimer || outzero));

Note that the expressions in x1 and x4 share common parts. Reorder the functions so that they are adjacent, to give the compiler a chance for optimization. Testing shows that this gives a very slight speedup. This leaves us with our final product:

x1:	*ref_dyn &= (includemask && !(refdynzero || refdyntimer || outzero));

x4:	if (includemask && !(refdynzero || refdyntimer) && outzero) {
		*ref = (*ref + *image_virgin) / 2;
	}
x2:	if (includemask && !((refdyntimer || outzero) && !refdynzero)) {
		(*ref_dyn)++;
	}
x3:	if (!(includemask && !(refdyntimer && !refdynzero))) {
		*ref = *image_virgin;
	}

This produces a function that should be equivalent to what we started with. Before starting work on the optimizations, I constructed a test harness in which I could hang both the original function and my replacements, and check if they gave the same outputs for all of the input range. I also added some timing code to get a handle on the exact speedups achieved. The benchmark suite shows that this function does give the same outcome as the original, and that its speed is virtually identical to the unoptimized version. That shouldn't be surprising since all we've done here is unroll the conditions, which is something any good compiler would also have done for the original code. The benefits come when we translate this to SSE intrinsics, which this code makes possible, and run the loop across 16 pixels at a time, using bitmasking instead of branching.

The result above is still not branchless. What we're after is this, but with SSE instructions:

unsigned char mask, outp;

*ref_dyn &= includemask & ~(refdynzero | refdyntimer | outzero);

mask = includemask & ~(refdynzero | refdyntimer) & outzero;
outp = (*ref + *image_virgin) / 2;
*ref = (*ref & ~mask) | (outp & mask);

mask = includemask & ~((refdyntimer | outzero) & ~refdynzero);
*ref_dyn = (*ref_dyn & ~mask) | ((*ref_dyn + 1) & mask);

mask = includemask & ~(refdyntimer & ~refdynzero);
*ref = (*image_virgin & ~mask) | (*ref & mask);

Notice how we changed logical operations to bitwise operations here, on the assumption (true with SSE) that the masks will be either 0x00 or 0xFF. Running this in the test jig shows that it gives identical results to the original code, but is around 4× slower, depending on platform. I actually found that disappointing since I thought the autovectorizer would eat this for lunch.

End result

The final results can be seen at my GitHub repository, specifically in this file.

Because this code is not portable beyond x86, but Motion is commonly deployed on small ARM servers like the Raspberry Pi, I did some tests to see whether I could rewrite the algorithms using GCC's architecture-agnostic vector extensions. My proof-of-concept shows that this can be done, but no, it's not a win. It's slower than the "plain" code by four times or so, results vary.

The SSE intrinsics let you be clever and take advantage of things you can't express in plain C, like saturated subtraction, unpacking, builtin averaging, bitmask extraction and so on. With the vector extensions, the best you can do is spell these functions out in textbook fashion and hope the compiler catches the idiom. But neither GCC nor LLVM at the time of writing seem very sharp, or are not allowed to optimize for them as they would violate the letter of the C standard. So that was a dead end performance-wise.