Open
Description
Each Haswell CPU core has eight special-purpose execution units that can each execute some part of an instruction in parallel. For example, calculate an address, load an operand from memory, perform arithmetic.
I realized today that pmu-tools offers some visibility into CPU performance counters that track how much work each execution unit is doing:
$ ocperf.py stat -e cycles,uops_executed_port.port_0,uops_executed_port.port_1,uops_executed_port.port_2,uops_executed_port.port_3,uops_executed_port.port_4,uops_executed_port.port_5,uops_executed_port.port_6,uops_executed_port.port_7 head -c 10000000 /dev/urandom > /dev/null
Performance counter stats for 'head -c 10000000 /dev/urandom':
2,065,534,404 cycles [44.69%]
705,149,766 uops_executed_port_port_0 [44.93%]
728,047,007 uops_executed_port_port_1 [44.94%]
405,801,626 uops_executed_port_port_2 [44.94%]
441,800,214 uops_executed_port_port_3 [44.50%]
289,902,540 uops_executed_port_port_4 [44.06%]
733,201,801 uops_executed_port_port_5 [44.05%]
786,927,002 uops_executed_port_port_6 [44.64%]
174,929,604 uops_executed_port_port_7 [44.44%]
0.908605822 seconds time elapsed
This seems rather nifty. I have recently been needing more visibility into the CPU for debugging difficult performance problems like collisions due to cache associativity.
I would love to be better with auditing performance counters. Tips welcome? ("Ten CPU Performance Counters You Won't Believe You Ever Lived Without?").
Activity
[-]Haswell execution units and ocperf[/-][+]Execution units and performance counters[/+]lukego commentedon Jul 19, 2015
The output above makes sense. The workload is getting pseudo-random numbers from
/dev/urandom
and the busy execution units are 0,1,5,6 which are exactly the ones that can perform integer arithmetic. That is gratifying :-).