向量化

Requirements

loop得有可以（ determinable (at run time)）确定的循环次数
loop里不能包含（inline不算）过程调用
loop里不能包含分支
迭代次数也得多
迭代之间最好也没有依赖性
理论上就算有依赖性也可以向量化但是不实用

现代向量处理器

依赖于大量的硬件单元而非流水线
有SIMD指令
大量依赖缓存，所以对齐很重要

几个x86 vec指令

MMX instructions

适用于 32,16 or 8-bit 整数类型的计算指令，有8 x 64-bit 的寄存器： MM0 … MM7

SSE instructions

支持：64 & 32-bit floating point，64, 32,16 or 8-bit integer types，有8 x 128-bit 寄存器： xmm0 … xmm7 (xmm8 .. xmm16 in 64-bit mode)

AVX instructions

ymm0 … ymm15 寄存器相比于SSE拓展到256bits

AVX2 AVX-512

AVX-512只在 Intel KNL and Skylake上有，更强力512-bits寄存器，zmm0 … zmm15，他分四部分：Foundation（扩展32，64指令），Conflict Detection Instructions（检测要vec的loop的冲突，尽可能让更多的loop能vec），Exponential and Reciprocal Instructions（KNL里的，能支持超越运算，咱也不懂。。）， Prefetch Instructions （KNL里的预取指支持）

AMD和ARM的

AMD：

256bits的AVX2已经支持，但是512的在下一代

ARM：

大多数没有vec支持，通过 NEON instruction set有一些现代HPC处理器支持128bits，但是有SVE: Scalable Vector Extensions这个东西能给芯片制造商提供支持不同长度vec的指令（A64FX 512-bit SVE ）

对齐

AVX在32字节对齐的地址上工作的最好

所以一般需要告诉编译器对齐，并且连续的去访问内存

这里说的都是intel上的，两步走：

分配

动态内存

_mm_malloc, _mm_free
float *a = _mm_malloc(1024*sizeof(float),64);

静态内存

float a[1024] __attribute__((aligned(64)));

多维数据对齐

使用_mm_malloc分配多维数组的话，可能出现问题：


float* a = _mm_malloc(16*15*sizeof(float), 64);

for(i=0;i<16;i++){
    #pragma vector aligned
        for(j=0;j<15;j++){
            a[i*15+j]++;
        }
}

比如这里实际上intel这个函数做了一些其他操作，尤其是在对齐和向量化的前提下，内层循环是15层，编译器知道内层要vectorize，对齐是64字节也就是8个float，内层的15次循环要做512bit向量化的话就会出问题的，因为512bit对应64字节，内层16次循环的话这个就比较完美。抛开这种当作二维用的情况，一维情况下从内存分配器那里就特殊处理了末尾，比如这个16*15的刚好是8的倍数不存在末尾，但是你这样用循环就存在了末尾：7个作为一个末尾，可是编译器又不知道你这样搞，他还是当你单层循环，向量化的程序就会出现问题，可能产生错误的结果

通知编译器

#pragma vector aligned 放在loop前，告诉编译器里面data是对齐的
__assume_aligned(a, 64); 放在循环内都可以，告诉编译器a这个array是对齐的
__assume(n1%16==0); 可能还需要告诉编译器这个loop scalars的属性，来个例子就知道了：


__declspec(align(64)) float X[1000], X2[1000];

void foo(float * restrict a, int n, int n1, int n2) {
  __assume_aligned(a, 64);
  __assume(n1%16==0);
  __assume(n2%16==0);

  for(int i=0;i<n;i++) { // Compiler vectorizes loop with all aligned accesses
    X[i] += a[i] + a[i+n1] + a[i-n1]+ a[i+n2] + a[i-n2];
  }
  
  for(int i=0;i<n;i++) { // Compiler vectorizes loop with all aligned accesses
    X2[i] += X[i]*a[i];
  }
}

openmp的指令： #pragma omp simd aligned(a:64)

这样每个线程就可以去均分并操作整齐的数据


__declspec(align(64)) float X[1000], X2[1000];

void foo(float * restrict a, int n, int n1, int n2) {
  int i;

  __assume(n1%16==0);
  __assume(n2%16==0);

#pragma omp simd aligned(X:64,a:64)
  for(i=0;i<n;i++) {
    X[i] += a[i] + a[i+n1] + a[i-n1]+ a[i+n2] + a[i-n2];
  }

#pragma omp simd aligned(a:64)
  for(i=0;i<n;i++) {
    X2[i] += X[i]*a[i];
  }

}

检查编译器是否向量化程序

CCE: -hlist=a

GNU: -fdump-tree-vect-all=

Intel: -opt-report3

AMD/Clang:-Rpass-analysis=.*

使用perf来找线索

充分配合编译器向量化

尽量消除迭代之间的依赖性（尽量多的使用循环的index而非一个外部变量来在loop中运算）：
将分支移出循环，if
#pragma ivdep 告诉编译器循环是独立的无依赖的
restrict 告诉编译器这个变量或者数组是唯一的
注意对齐
确保循环足够大，并确定循环大小

Gathers and Scatters

数据的访问模式不是均步的，他也有可能可以向量化。

首先说明，这功能是基于处理器指令的，处理器没实现的话，就没法用。KNL有专门的Gathers and Scatters指令，但是还是比对齐的数据损耗更大。

以网上找的一个人写的SIMD的库的教程来说明，我觉得有实际例子会比较好：

UME::SIMD

三种情况；

跨步访问，一次跨3个元素访问下一个
索引访问，以新计算的偏移量访问数组元素
改组访问，以不同的顺序访问元素

跨步访问

首先明确一点，gather是load的超集。

但是load效率高，速度快，所以还是多使用load来将连续内存搞到寄存器里。如果可以通过修改数据结构和代码的方式做到那就尽量去做，少使用gather。


float a[LARGE_DATA_SIZE];
uint32_t STRIDE = 8;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;
 
  // Note that we have to scale the loop index.
  int offset = i*STRIDE;
 
  // 'load' the data to vec.
  vec.gather(&a[offset], STRIDE);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[offset], STRIDE);
}

索引访问

float a[LARGE_DATA_SIZE];
int indices[PROBLEM_SIZE];
uint32_t STRIDE = 4;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;
 
  // Here we are using precomputed indices,
  // but they can be computed on-the-fly if necessary.
  SIMDVec<uint32_t, 8> indices_vec(&indices[i];
 
  // 'load' the data to vec.
  vec.gather(&a[0], indices_vec);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[0], indices_vec);
}

Masking and Blending

Mask 可以被认为是是一个描述分支的掩码数组，通过它和特殊的指令：blend，可以将内含分支的loop向量化掉：

before:

for (i = 0; i < N; i++) {
    if (Trigger[i] < Val) {
        A[i] = B[i] + 0.5;
    }else{
        A[i] = B[i] - 0.5;
    }
}

after:

for (i = 0; i < N; i+=16) {
    TmpB= B[i:i+15];
    Mask = Trigger[i:i+15] < Val
    TmpA1 = TmpB + 0.5;
    TmpA2 = TmpB - 0.5;
    TmpA = BLEND Mask, TmpA1, TmpA2
    A[i:i+15] = TmpA;
}

看代码就知道啥意思了

再记一下arm的一些支持

ARM的向量化指令功能叫做SVE:

SVE is vector length independent
- Allows hardware to be created and used between 128-bits and 2048-bits
- Current processors using it have 512-bit vectors
Programming approach allows executable to scale dynamically to available vector length
Designed to help improve auto-vectorization
- Instructions to support speculative vectorization to allow uncounted loops to be vectorized.
- Instructions to make it easier to vectorise outer loops, working with dependencies
Gather-load and scatter-store
- Loads a single vector register from non-contiguous memory locations.
Per-lane predication
- Operate on individual lanes of vector controlled by of a governing predicate register.
Predicate-driven loop control and management
- Eliminate loop heads and tails and other overhead by processing partial vectors.
Vector partitioning for software-managed speculation
- First-fault vector load instructions allow vector accesses to cross into invalid pages.
Extended floating-point and bitwise horizontal reductions
- In-order or tree-based floating-point sum, trade-off repeatability vs performance.