Fast Base64 encoding/decoding with SSE vectorization

October 29, 2014

In this installment of "SSE vector programming for fun and profit", we're going to write a Base64 encoder that's about four times faster than everything else available. The results of this process can be found in my Base64 library at GitHub, which is licensed under the BSD 2-clause.

Before we get to Base64, we have to take a closer look at how exactly strings are stored inside an SSE register.

Normally when working with arrays of integers or floats, you load values into an SSE register, do your arithmetic, and write the result back out to memory. You don't care in which particular lane your variables land, as long as you put four floats in and get four floats back. However, when talking about strings, and specifically about shuffling and shifting bytes in certain directions, it becomes critical to know how those strings are stored. What does it mean to left shift a string? You need to know how "left" is defined to answer that question.

SSE, strings and endianness

Like much in the Intel world, SSE registers are little-endian. No problem so far; this still means that the lowest location in memory becomes the lowest byte element in the vector. If you have the string ABCDEFGHIJKLMNOP, and you load it into a register with _mm_loadu_si128(), the A will be in byte lane 0 and P will be in 15. (Lane numbers come into play when shuffling elements.) You can check that this is the way it works by running the following demo program, which extracts and prints the lowest 4 bytes from the vector:

#include <stdio.h>
#include <emmintrin.h>

union {
	int low;
	char str[5];
}
shared;

int main ()
{
	// Create vector from string:
	__m128i vec = _mm_loadu_si128((__m128i *)"ABCDEFGHIJKLMNOP");

	// Move lowest 32 bits to int:
	shared.low = _mm_cvtsi128_si32(vec);

	// Print as character array:
	shared.str[4] = '\0';
	puts(shared.str);

	// This prints the string 'ABCD'.
}

But here is the confusing (initially) part: inside the vector, when talking about directions for the purposes of shifting, it makes sense to think of the byte lanes as ordered left to right from highest to lowest lane. The A in our example is the rightmost byte. The string is conceptually stored in reverse order:

PONM LKJI HGFE DCBA
   |    |    |    |
  12    8    4    0

To demonstrate that this is actually how it works, here's what happens when you right shift by one byte:

#include <stdio.h>
#include <emmintrin.h>

int main ()
{
	char str[17] = { 0 };

	// Create vector from string:
	__m128i vec = _mm_loadu_si128((__m128i *)"ABCDEFGHIJKLMNOP");

	// Right-shift by 1 byte, store result:
	_mm_storeu_si128((__m128i *)str, _mm_srli_si128(vec, 1));

	// Print the result:
	puts(str);

	// This prints the string 'BCDEFGHIJKLMNOP'.
}

So a logical right shift inside your vector becomes a logical left shift of your string when written back to memory. Start thinking of your strings as stored in reverse for what we're going to do next.

The Base64 algorithm

The core of the Base64 algorithm is really simple. Take three adjacent bytes and redivide their bits over four bytes, with six bits to each byte. The upper two bits of the output bytes are zeroed. The bytes in this story are in "proper" string order, so A shifts over into B, B into C, and most of C is shifted eight places to the right into a new, fourth carry byte. Move your mouse over the illustration below to see this in action.

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A7
A6
A5
A4
A3
A2
A1
A0
B7
B6
B5
B4
B3
B2
B1
B0
C7
C6
C5
C4
C3
C2
C1
C0

You now have four bytes with values in the range from 0 to 63. The characters with those values are not printable ASCII, so to make the final output, the values are replaced with printable characters. 0 to 25 map to A–Z, 26 to 51 map to a–z, 52 to 61 map to 0–9, 62 maps to the character +, and 63 maps to the character /. If the input is not a multiple of three bytes, the output is padded with = characters to make four-byte output blocks. The details of that won't concern us, that's for RFC 4648. We're strictly going to encode full three-byte blocks into four-byte blocks.

With one 128-bit vector operation, we can turn 12 input bytes into 16 output bytes. First we load the string. Then we shuffle the bytes in the vector so that we are left with the twelve that we need. Note above that the third byte in every block of four is shifted over by eight bits. Duplicating every third byte into every fourth will place a copy of the third byte in its final resting place. The original fourth byte bumps up one position to the fifth place. In the resulting string below, the last four bytes have all been pushed out to make room for the carry bytes:

LLKJ IIHG FFED CCBA

From here on, we're treating each block of four as separate, parallel 32-bit integers, and we will be using the 32-bit wide shifting operations on them.

If we right-shift now, the low bits from B will be shifted into the high bits of A instead of the high bits of C. If we "solve" this by left-shifting instead, we're not moving the lower bits, but the upper bits, which is incorrect. Can't go without right shifting. So to get the bits in each block of four to flow into the correct neighbour, we must reorder the bytes once more into a big-endian format:

JKLL GHII DEFF ABCC

Now we can right-shift and mask normally. A flows into B, B flows into C. All we need to do in the output stage is reverse the bytes again.

Putting this all together, you get the following code, which is the meat of my SSE accelerated Base64 encoding routine without all the boilerplate:

while (srclen >= 12)
{
	__m128i str, mask, res, blockmask;
	__m128i s1, s2, s3, s4, s5;
	__m128i s1mask, s2mask, s3mask, s4mask;

	// Load string:
	str = _mm_loadu_si128((__m128i *)ipos);

	// Reorder to 32-bit big-endian, duplicating the third byte in every block of four.
	// This copies the third byte to its final destination, so we can include it later
	// by just masking instead of shifting and masking.
	// The workset must be in big-endian, otherwise the shifted bits do not carry over
	// properly among adjacent bytes:
	str = _mm_shuffle_epi8(str,
	      _mm_setr_epi8(2, 2, 1, 0, 5, 5, 4, 3, 8, 8, 7, 6, 11, 11, 10, 9));

	// Mask to pass through only the lower 6 bits of one byte;
	mask = _mm_set1_epi32(0x3F000000);

	// Shift bits by 2, mask in only the first byte:
	res = _mm_srli_epi32(str, 2) & mask;
	mask = _mm_srli_epi32(mask, 8);

	// Shift bits by 4, mask in only the second byte:
	res |= _mm_srli_epi32(str, 4) & mask;
	mask = _mm_srli_epi32(mask, 8);

	// Shift bits by 6, mask in only the third byte:
	res |= _mm_srli_epi32(str, 6) & mask;
	mask = _mm_srli_epi32(mask, 8);

	// No shift necessary for the fourth byte because we duplicated
	// the third byte to this position; just mask:
	res |= str & mask;

	// Reorder to 32-bit little-endian:
	res = _mm_shuffle_epi8(res,
	      _mm_setr_epi8(3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12));

	// The bits have now been shifted to the right locations;
	// translate their values 0..63 to the Base64 alphabet:

	// set 1: 0..25, "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
	s1mask = _mm_cmplt_epi8(res, _mm_set1_epi8(26));
	blockmask = s1mask;

	// set 2: 26..51, "abcdefghijklmnopqrstuvwxyz"
	s2mask = _mm_andnot_si128(blockmask, _mm_cmplt_epi8(res, _mm_set1_epi8(52)));
	blockmask |= s2mask;

	// set 3: 52..61, "0123456789"
	s3mask = _mm_andnot_si128(blockmask, _mm_cmplt_epi8(res, _mm_set1_epi8(62)));
	blockmask |= s3mask;

	// set 4: 62, "+"
	s4mask = _mm_andnot_si128(blockmask, _mm_cmplt_epi8(res, _mm_set1_epi8(63)));
	blockmask |= s4mask;

	// set 5: 63, "/"
	// Everything that is not blockmasked

	// Create the masked character sets:
	s1 = s1mask & _mm_add_epi8(res, _mm_set1_epi8('A'));
	s2 = s2mask & _mm_add_epi8(res, _mm_set1_epi8('a' - 26));
	s3 = s3mask & _mm_add_epi8(res, _mm_set1_epi8('0' - 52));
	s4 = s4mask & _mm_set1_epi8('+');
	s5 = _mm_andnot_si128(blockmask, _mm_set1_epi8('/'));

	// Blend all the sets together and store:
	_mm_storeu_si128((__m128i *)opos, s1 | s2 | s3 | s4 | s5);

	ipos += 12;	// 3 * 4 bytes of input
	opos += 16;	// 4 * 4 bytes of output
	srclen -= 12;
}
*outlen = opos - out;

Base64 decoding

Decoding a Base64 string is also possible using SSE instructions, but that's left as an exercise for the reader. My Base64 library includes an SSE accelerated decoder.