benchmark¶
op
naive C with openmp¶
for for for
unroll, first try¶
h
register allocation¶
kernels
unroll, second try¶
simd
neon intrinsics¶
optional
naive neon assembly with pld¶
asm
pipeline optimize, first try¶
more register load mla
pipeline optimize, second try¶
interleave load mla
pipeline optimize, third try¶
loop tail
usual practice, load/save¶
233
usual practice, unroll¶
233
usual practice, save register¶
233