clflush para invalidar la línea de caché a través de la función C

Estoy intentando usar clflush para desalojar manualmente una línea de caché para determinar el caché y los tamaños de línea. No encontré ninguna guía sobre cómo usar esa instrucción. Todo lo que veo son algunos códigos que usan funciones de nivel superior para ese propósito.

Hay una función del kernel void clflush_cache_range(void *vaddr, unsigned int size) , pero todavía no sé qué incluir en mi código y cómo usarlo. No sé cuál es el size en esa función.

Más que eso, ¿cómo puedo estar seguro de que se desaloja la línea para verificar la corrección de mi código?

ACTUALIZAR:

Aquí hay un código inicial para lo que estoy tratando de hacer.

 #include  #include  #include  #include  int main() { int array[ 100 ]; /* will bring array in the cache */ for ( int i = 0; i < 100; i++ ) array[ i ] = i; /* FLUSH A LINE */ /* each element is 4 bytes */ /* assuming that cache line size is 64 bytes */ /* array[0] till array[15] is flushed */ /* even if line size is less than 64 bytes */ /* we are sure that array[0] has been flushed */ _mm_clflush( &array[ 0 ] ); int tm = 0; register uint64_t time1, time2, time3; time1 = __rdtscp( &tm ); /* set timer */ time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */ printf( "miss latency = %lu \n", time2 ); time3 = __rdtscp( &array[ 0 ] ) - time2; /* array[0] is a cache hit */ printf( "hit latency = %lu \n", time3 ); return 0; } 

Antes de ejecutar el código, me gustaría verificar manualmente que sea un código correcto. ¿Estoy en el camino correcto? ¿Utilicé _mm_clflush correctamente?

ACTUALIZAR:

Gracias al comentario de Peter, arreglé el código de la siguiente manera

  time1 = __rdtscp( &tm ); /* set timer */ time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */ printf( "miss latency = %lu \n", time2 ); time1 = __rdtscp( &tm ); /* set timer */ time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache hit */ printf( "hit latency = %lu \n", time1 ); 

Al ejecutar el código varias veces, obtengo la siguiente salida

 $ ./flush miss latency = 238 hit latency = 168 $ ./flush miss latency = 154 hit latency = 140 $ ./flush miss latency = 252 hit latency = 140 $ ./flush miss latency = 266 hit latency = 252 

La primera carrera parece ser razonable. Pero la segunda carrera parece extraña. Ejecutando el código desde la línea de comando, cada vez que la matriz se inicializa con los valores y luego desalojo explícitamente la primera línea.

ACTUALIZACIÓN4:

Probé el código Hadi-Brais y aquí están las salidas

 naderan@webshub:~$ ./flush3 address = 0x7ffec7a92220 array[ 0 ] = 0 miss section latency = 378 array[ 0 ] = 0 hit section latency = 175 overhead latency = 161 Measured L1 hit latency = 14 TSC cycles Measured main memory latency = 217 TSC cycles naderan@webshub:~$ ./flush3 address = 0x7ffedbe0af40 array[ 0 ] = 0 miss section latency = 392 array[ 0 ] = 0 hit section latency = 231 overhead latency = 168 Measured L1 hit latency = 63 TSC cycles Measured main memory latency = 224 TSC cycles naderan@webshub:~$ ./flush3 address = 0x7ffead7fdc90 array[ 0 ] = 0 miss section latency = 399 array[ 0 ] = 0 hit section latency = 161 overhead latency = 147 Measured L1 hit latency = 14 TSC cycles Measured main memory latency = 252 TSC cycles naderan@webshub:~$ ./flush3 address = 0x7ffe51a77310 array[ 0 ] = 0 miss section latency = 364 array[ 0 ] = 0 hit section latency = 182 overhead latency = 161 Measured L1 hit latency = 21 TSC cycles Measured main memory latency = 203 TSC cycles 

Las latencias ligeramente diferentes son aceptables. Sin embargo, la latencia de 63, en comparación con 21 y 14, también es observable.

ACTUALIZACIÓN5:

Cuando revisé Ubuntu, no hay una función de ahorro de energía habilitada. Tal vez el cambio de frecuencia esté desactivado en la BIOS, o hay una configuración incorrecta

 $ cat /proc/cpuinfo | grep -E "(model|MHz)" model : 79 model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz cpu MHz : 2097.571 model : 79 model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz cpu MHz : 2097.571 $ lscpu | grep MHz CPU MHz: 2097.571 

De todos modos, eso significa que la frecuencia se establece en su valor máximo, que es lo que me tiene que importar. Al ejecutar varias veces, veo algunos valores diferentes. ¿Son estos normales?

 $ taskset -c 0 ./flush3 address = 0x7ffe30c57dd0 array[ 0 ] = 0 miss section latency = 602 array[ 0 ] = 0 hit section latency = 161 overhead latency = 147 Measured L1 hit latency = 14 TSC cycles Measured main memory latency = 455 TSC cycles $ taskset -c 0 ./flush3 address = 0x7ffd16932fd0 array[ 0 ] = 0 miss section latency = 399 array[ 0 ] = 0 hit section latency = 168 overhead latency = 147 Measured L1 hit latency = 21 TSC cycles Measured main memory latency = 252 TSC cycles $ taskset -c 0 ./flush3 address = 0x7ffeafb96580 array[ 0 ] = 0 miss section latency = 364 array[ 0 ] = 0 hit section latency = 161 overhead latency = 140 Measured L1 hit latency = 21 TSC cycles Measured main memory latency = 224 TSC cycles $ taskset -c 0 ./flush3 address = 0x7ffe58291de0 array[ 0 ] = 0 miss section latency = 357 array[ 0 ] = 0 hit section latency = 168 overhead latency = 140 Measured L1 hit latency = 28 TSC cycles Measured main memory latency = 217 TSC cycles $ taskset -c 0 ./flush3 address = 0x7fffa76d20b0 array[ 0 ] = 0 miss section latency = 371 array[ 0 ] = 0 hit section latency = 161 overhead latency = 147 Measured L1 hit latency = 14 TSC cycles Measured main memory latency = 224 TSC cycles $ taskset -c 0 ./flush3 address = 0x7ffdec791580 array[ 0 ] = 0 miss section latency = 357 array[ 0 ] = 0 hit section latency = 189 overhead latency = 147 Measured L1 hit latency = 42 TSC cycles Measured main memory latency = 210 TSC cycles 

Usted tiene múltiples errores en el código que pueden conducir las mediciones sin sentido que está viendo. Solucioné los errores y puedes encontrar la explicación en los comentarios a continuación.

 /* compile with gcc at optimization level -O3 */ /* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ /* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */ /* you can optionally use a p-state scaling driver other than intel_pstate to get more reproducable results */ /* This code still needs improvement to obtain more accurate measurements, and a lot of effort is required to do that—argh! */ /* Specifically, there is no single constant latency for the L1 because of the way it's designed, and more so for main memory. */ /* Things such as virtual addresses, physical addresses, TLB contents, code addresses, and interrupts may have an impact that needs to be investigated */ /* The instructions that GCC puts unnecessarily in the timed section are annoying AF */ /* This code is written to run on Intel processors! */ #include  #include  #include  int main() { int array[ 100 ]; /* this is optional */ /* will bring array in the cache */ for ( int i = 0; i < 100; i++ ) array[ i ] = i; printf( "address = %p \n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */ _mm_mfence(); /* prevent clflush from being reordered by the CPU or the compiler in this direction */ /* flush the line containing the element */ _mm_clflush( &array[ 0 ] ); //unsigned int aux; uint64_t time1, time2, msl, hsl, osl; /* initial values don't matter */ /* rdtscp is not suitbale for measuing very small sections of code because the write to its parameter occurs after sampling the TSC and it impacts compiler optimizations and code gen, thereby perturbing the measurement */ _mm_mfence(); /* this properly orders both clflush and rdtscp*/ _mm_lfence(); /* mfence and lfence must be in this order + compiler barrier for rdtscp */ time1 = __rdtsc(); /* set timer */ _mm_lfence(); /* serialize __rdtscp with respect to trailing instructions + compiler barrier for rdtscp and the load */ int temp = array[ 0 ]; /* array[0] is a cache miss */ /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */ /* no need for mfence because there are no stores in between */ _mm_lfence(); /* mfence and lfence must be in this order + compiler barrier for rdtscp and the load*/ time2 = __rdtsc(); _mm_lfence(); /* serialize __rdtscp with respect to trailing instructions */ msl = time2 - time1; printf( "array[ 0 ] = %i \n", temp ); /* prevent the compiler from optimizing the load */ printf( "miss section latency = %lu \n", msl ); /* the latency of everything in between the two rdtscp */ _mm_mfence(); /* this properly orders both clflush and rdtscp*/ _mm_lfence(); /* mfence and lfence must be in this order + compiler barrier for rdtscp */ time1 = __rdtsc(); /* set timer */ _mm_lfence(); /* serialize __rdtscp with respect to trailing instructions + compiler barrier for rdtscp and the load */ temp = array[ 0 ]; /* array[0] is a cache hit as long as the OS, a hardware prefetcher, or a speculative accesses to the L1D or lower level inclusive caches don't evict it */ /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */ /* no need for mfence because there are no stores in between */ _mm_lfence(); /* mfence and lfence must be in this order + compiler barrier for rdtscp and the load */ time2 = __rdtsc(); _mm_lfence(); /* serialize __rdtscp with respect to trailing instructions */ hsl = time2 - time1; printf( "array[ 0 ] = %i \n", temp ); /* prevent the compiler from optimizing the load */ printf( "hit section latency = %lu \n", hsl ); /* the latency of everything in between the two rdtscp */ _mm_mfence(); /* this properly orders both clflush and rdtscp*/ _mm_lfence(); /* mfence and lfence must be in this order + compiler barrier for rdtscp */ time1 = __rdtsc(); /* set timer */ _mm_lfence(); /* serialize __rdtscp with respect to trailing instructions + compiler barrier for rdtscp */ /* no need for mfence because there are no stores in between */ _mm_lfence(); /* mfence and lfence must be in this order + compiler barrier for rdtscp */ time2 = __rdtsc(); _mm_lfence(); /* serialize __rdtscp with respect to trailing instructions */ osl = time2 - time1; printf( "overhead latency = %lu \n", osl ); /* the latency of everything in between the two rdtscp */ printf( "Measured L1 hit latency = %lu TSC cycles\n", hsl - osl ); /* hsl is always larger than osl */ printf( "Measured main memory latency = %lu TSC cycles\n", msl - osl ); /* msl is always larger than osl and hsl */ return 0; } 

Muy recomendable: medición de latencia de memoria con contador de marca de tiempo .

Relacionado: ¿Cómo puedo crear un gadget de espectro en la práctica? .

Sabes que puedes consultar el tamaño de línea con cpuid , ¿verdad? Hazlo si realmente quieres encontrarlo programáticamente. (De lo contrario, suponga que es de 64 bytes, porque está en todo después de PIII).

Pero claro, si quieres usar clflush o clflushopt de C por cualquier razón, usa void _mm_clflush(void const *p) o void _mm_clflushopt(void const *p) , desde #include . (Consulte la entrada manual de Intel insn set ref para clflush o clflushopt ).

GCC, clang, ICC y MSVC son compatibles con los intrínsecos de Intel.


También podría haberlo encontrado al buscar la guía de intrínsecos de Intel para clflush para encontrar definiciones de las características intrínsecas de esa instrucción.

ver también https://stackoverflow.com/tags/x86/info para más enlaces a guías, documentos y manuales de referencia.


Más que eso, ¿cómo puedo estar seguro de que se desaloja la línea para verificar la corrección de mi código?

Mire la salida de asm del comstackdor, o en un solo paso en un depurador. Si / cuando clflush ejecuta, esa línea de caché se desaloja en ese punto de su progtwig.