SSE string functions for x64 Linux

November 5, 2014

The story of this page is that I was nerd sniped. During the course of optimizing a strcasecmp() bottleneck at work, I discovered the assembly language versions of basic C string routines at strchr.com, and they tickled me to get back into assembly language. Something about a problem just difficult enough to be interesting, and just easy enough to be tractable as a side project.

I've programmed in x86 assembly language before as a teenager, but that was on a 486 that ran 16-bit DOS (with segments and offsets), with only a 16-bit assembler available to me, where I had to splice in my own bloody 32-bit opcodes with db directives. Still, I have many fond memories of small moments of magic. Like finally getting my screen13 scrolling routines to run properly inside of my PowerBasic demo programs. I later went on to do some assembly work on the Z80 processor of my TI-83 calculator (a weird platform), and even did some Win32 assembly with MASM, but then my interest faded. I found assembly language to make for fantastic rainy afternoon puzzles, but lacking in power to make actual complex, useful things.

Now that the '90's are over and open-source tooling is easily available, and we have fast 64-bit multicore CPUs that run everything in a single clock cycle, playing around with assembly is a lot more fun. Let's get to work! I present on this page a selection of small string routines written for 64-bit Linux platforms. I doubt that these are competitive in terms of speed to what common libraries use (they unroll their loops, are concerned with things like data dependencies and cache performance, and actually benchmark their code), but it's been a fun puzzle.

All routines shown are written for NASM in Intel assembly syntax (the One True Form) and assume 64-bit (System V AMD64 ABI) calling conventions. These are so much nicer than the 32-bit calling conventions. The first two arguments (that's all we need here) are passed in rdi and rsi, return values go in rax, and the only registers the callee needs to save are rbp, rbx and r12–r15 (source). The routines use SSE vector instructions where applicable, and consist of position-independent code, so that they can be included in shared libraries. Because this is the bright new x64 universe for which NASM has good defaults, we don't need to set up segments, but can get away with declaring our functions and our data global.

Assembling and linking

Let's say you have strlen.asm. Assemble it into an object file named strlen.o like so:

nasm -f elf64 strlen.asm

Add a function declaration to your C program so the compiler knows how to call the function. I've added these in separate blocks for the functions below:

size_t strlen (const char *s);

In the linker stage, simply add the object file to the list of object files on the compiler's commandline to link them in to the final binary.

Caveat

These functions are all unaware of how long a string is, and will read 16 bytes of memory at a time until a zero byte is found. This might mean that they read a few bytes past the end of the string, which could cause a memory access violation.

All these functions are only lightly tested. In production code, use the functions provided by your regular libc, they'll probably be faster and battle-hardened. This is a hobby project, no guarantees given or implied, etc.

License

Because code must have a license, I declare these functions to be licensed under the 2-clause BSD license.

strlen_sse2

Determine the length of a zero-terminated string, using only SSE2-level instructions. This code looks a lot like the code at strchr.com, but that's because this really is the most condensed, simplified way to write it. When you're working at this level, there is not much opportunity for stylistic flourish.

size_t strlen_sse2 (const char *s);
global strlen_sse2:function

strlen_sse2:

	mov       rax, -16		; 64-bit init to -16
	pxor      xmm0, xmm0		; zero the comparison register

.loop:	add       rax, 16		; add 16 to the offset
	movdqu    xmm1, [rdi + rax]	; unaligned string read
	pcmpeqb   xmm1, xmm0		; compare string against zeroes
	pmovmskb  ecx, xmm1		; create bitmask from result
	test      ecx, ecx		; set flags based on bitmask
	jz        .loop			; no bits set means no zeroes found

	bsf       ecx, ecx		; find position of first set bit
	add       rax, rcx		; 64-bit add position to offset
	ret

strlen_sse4

Determine the length of a zero-terminated string, using the SSE4.2 pcmpistri string instruction.

size_t strlen_sse4 (const char *s);
global strlen_sse4:function

strlen_sse4:

	mov        rax, -16
	pxor       xmm0, xmm0

.loop:	; Must perform addition before the string comparison;
	; doing it after would interfere with the status flags:

	add        rax, 16
	pcmpistri  xmm0, [rdi + rax], 0x08	; EQUAL_EACH
	jnz        .loop

	add        rax, rcx
	ret

strcmp_x64

Compare zero-terminated string a against string b. Returns zero if the strings are equal, less than zero if a < b, and greater than zero if a > b.

Actually, this function makes stronger guarantees about its return value, so that it is identical (and thus comparable) to glibc's strcmp. It will return the signed difference between the first differing character between a and b. For example, if a is 0xFF and b is 0, it returns -255.

int strcmp_x64 (const char *s1, const char *s2);
global strcmp_x64:function

strcmp_x64:

	xor	eax, eax
	xor	ecx, ecx
	xor	rdx, rdx

.loop:	mov	cl, [rsi + rdx]	; fetch bytes
	mov	al, [rdi + rdx]

	mov	r8b, cl		; move byte to temporary
	or	r8b, al		; if both bytes are null,
	jz	.ret		; strings are equal

	sub	ax, cx		; subtract bytes,
	jne	.ext		; result from -255 to 255

	inc	rdx		; ax was found to be zero,
	jmp	.loop		; go for next round

.ext:	cwde			; sign-extend ax to eax
	cdqe			; sign-extend eax to rax

.ret:	ret

strcmpeq_x64

Comparing strings for equality is one of the most popular uses for strcmp, but is slightly inefficient because of the extra arithmetic performed on the differing bytes. If we're only interested in equality, we can take some shortcuts. The family of strcmpeq functions are a nonstandard variation of strcmp that only check whether two strings are equal or not, and return 0 or 1 (which can be typed to a compiler as a C99 bool).

Below is the tightest loop I could construct with only non-SIMD x64 instructions that performs strcmpeq.

int strcmpeq_x64 (const char *s1, const char *s2);
global strcmpeq_x64:function

strcmpeq_x64:

	xor	rax, rax
	xor	rdx, rdx

.loop:	mov	cl, [rsi + rdx]	; fetch bytes
	mov	ch, [rdi + rdx]

	test	cx, cx		; if both bytes are null,
	jz	.eql		; strings are equal

	cmp	cl, ch		; compare bytes,
	jne	.dif		; quit if not equal

	inc	rdx		; else go for next round
	jmp	.loop

.eql:	inc	al
.dif:	ret

strcmpeq_sse2

Return 1 if zero-terminated strings a and b are equal, 0 if not. Uses SSE2-level vector instructions to do the least amount of work per byte. Bails out at the first sign that the strings must differ.

int strcmpeq_sse2 (const char *s1, const char *s2);
global strcmpeq_sse2:function

strcmpeq_sse2:

	xor	rdx, rdx
	pxor	xmm0, xmm0

.loop:	movdqu	xmm1, [rdi + rdx]	; load strings
	movdqu	xmm2, [rsi + rdx]

	movdqa	xmm6, xmm1	; copy strings
	movdqa	xmm7, xmm2

	pcmpeqb	xmm6, xmm0	; check copies for 0x00
	pcmpeqb	xmm7, xmm0

	pmovmskb  eax, xmm6	; fetch bitmasks
	pmovmskb  ecx, xmm7

	; If both masks are zero, neither strings contain a null byte and we
	; take a fast track where we just check the vectors for equality:

	test	ax, cx
	jnz	.tail

	pcmpeqb   xmm1, xmm2	; compare strings directly
	pmovmskb  eax, xmm1	; fetch bitmask

	not	ax		; if the inverse of the bitmask is not zero,
	test	ax, ax		; the strings differ in at least one place
	jnz	.dif

	add	rdx, 16		; match so far; go in for the next loop round
	jmp	.loop

.tail:	; If one mask is zero, one string contains a null byte but the other
	; doesn't; strings are unequal. Must test for this case because bsf is
	; not defined when run on a zero input:

	test	ax, ax
	jz	.dif

	test	cx, cx
	jz	.dif

	; Neither mask is zero, so both vectors contain a null byte somewhere;
	; take a slow path where we compensate for the null bytes:

	bsf	ax, ax		; find bit positions of first null bytes
	bsf	cx, cx

	cmp	ax, cx		; if the nulls do not start at the same byte,
	jne	.dif		; the strings have different lengths; exit

	pcmpeqb   xmm1, xmm2	; compare strings
	pmovmskb  eax, xmm1	; fetch bitmask

	; If the inverse of the bitmask is zero, full match.
	; Must again do this check because bsf is not defined for zero input:

	not	ax		; does not affect the flags
	test	ax, ax		; test for zero
	jz	.eql

	bsf	ax, ax		; get first nonmatching position
	cmp	ax, cx		; if this is less than where the nulls start,
	jl	.dif		; the strings differ

	; The strings are equal up to the null byte; fallthrough:

.eql:	mov	ax, 1		; upper bits are known to be zero
	ret

.dif:	xor	rax, rax
	ret

strcmpeq_sse4

Compare two strings for equality using SSE4.2's pcmpistri instruction, which was virtually made for the job. It handles zero-termination and sets all the right flags for us. Compared to the SSE2 code above, this is extremely compact.

int strcmpeq_sse4 (const char *s1, const char *s2);
global strcmpeq_sse4:function

strcmpeq_sse4:

	xor        rax, rax
	xor        rdx, rdx

.loop:	movdqu     xmm1, [rdi + rdx]
	pcmpistri  xmm1, [rsi + rdx], 0x18	; EQUAL_EACH | NEGATIVE_POLARITY
	jc         .dif
	jz         .eql
	add        rdx, 16
	jmp        .loop

.eql:   inc        eax
.dif:   ret

strcasecmpeq_sse2

Compare zero-terminated strings s1 and s2 in case-insensitive fashion. Return 1 if they are equal and 0 if they are not. The case-insensitive part is a 1970's-style hack: it does no actual locale-dependent or language-aware conversion, all characters in the A–Z range are simply arithmetically transposed to characters in the a–z range. It's fast if you just care about comparing plain ASCII strings.

Something interesting about this code is that we need to load three SSE registers from memory. Because we want this code to be position-independent, we have to declare those memory locations global, and reference them through an offset from the Global Offset Table. Linking global variables into the binary is not the tidiest technique, but avoids a lot of offset lookup code.

int strcasecmpeq_sse2 (const char *s1, const char *s2);
extern _GLOBAL_OFFSET_TABLE_
global aldiff:data 16
global uppera:data 16
global upperz:data 16
global strcasecmpeq_sse2:function

align 16
aldiff:	times 16 db 'a' - 'A'	; difference between lower and uppercase
uppera:	times 16 db 'A' - 1	; we test for "greater than"; off by one
upperz:	times 16 db 'Z'

%macro lcase 1
	movdqa	  xmm6, %1	; xmm6 := xxx
	movdqa	  xmm7, %1	; xmm7 := xxx
	pcmpgtb	  xmm6, xmm3	; xmm6 := (xxx >= 'A')
	pcmpgtb	  xmm7, xmm4	; xmm7 := (xxx > 'Z')
	pandn	  xmm7, xmm6	; xmm7 := (xxx >= 'A' && !(xxx > 'Z'))
	pand	  xmm7, xmm5	; xmm7 &= aldiff
	paddusb   %1, xmm7	; xxx += xmm7
%endmacro

strcasecmpeq_sse2:

	xor	rdx, rdx
	pxor	xmm0, xmm0

	movdqa	xmm3, [rel uppera wrt ..got]
	movdqa	xmm4, [rel upperz wrt ..got]
	movdqa	xmm5, [rel aldiff wrt ..got]

.loop:	movdqu	xmm1, [rdi + rdx]	; load strings
	movdqu	xmm2, [rsi + rdx]

	movdqa	xmm6, xmm1	; copy strings
	movdqa	xmm7, xmm2

	pcmpeqb	xmm6, xmm0	; check copies for 0x00
	pcmpeqb	xmm7, xmm0

	pmovmskb  eax, xmm6	; fetch bitmasks
	pmovmskb  ecx, xmm7

	; If both masks are zero, neither strings contain a null byte and we
	; take a fast track where we just check the vectors for equality:

	test	ax, cx
	jnz	.tail

	lcase	xmm1		; lowercase the strings
	lcase	xmm2

	pcmpeqb   xmm1, xmm2	; compare strings directly
	pmovmskb  eax, xmm1	; fetch bitmask

	not	ax		; if the inverse of the bitmask is not zero,
	test	ax, ax		; the strings differ in at least one place
	jnz	.dif

	add	rdx, 16		; match so far; go in for the next loop round
	jmp	.loop

.tail:	; If one mask is zero, one string contains a null byte but the other
	; doesn't; strings are unequal. Must test for this case because bsf is
	; not defined when run on a zero input:

	test	ax, ax
	jz	.dif

	test	cx, cx
	jz	.dif

	; Neither mask is zero, so both vectors contain a null byte somewhere;
	; take a slow path where we compensate for the null bytes:

	bsf	ax, ax		; find bit positions of first null bytes
	bsf	cx, cx

	cmp	ax, cx		; if the nulls do not start at the same byte,
	jne	.dif		; the strings have different lengths; exit

	lcase	xmm1		; lowercase the strings
	lcase	xmm2

	pcmpeqb   xmm1, xmm2	; compare strings
	pmovmskb  eax, xmm1	; fetch bitmask

	; If the inverse of the bitmask is zero, full match.
	; Must again do this check because bsf is not defined for zero input:

	not	ax		; does not affect the flags
	test	ax, ax		; test for zero
	jz	.eql

	bsf	ax, ax		; get first nonmatching position
	cmp	ax, cx		; if this is less than where the nulls start,
	jl	.dif		; the strings differ

	; The strings are equal up to the null byte; fallthrough:

.eql:	mov	ax, 1		; upper bits are known to be zero
	ret

.dif:	xor	rax, rax
	ret