ARM Cortex-A8 instruction timings

For various reasons, I have been investigating instruction scheduling on ARM's new-ish Cortex-A8 processor. The main public source of information on this is the Cortex-A8 TRM - at the time of writing, the latest release is r3p2, dated 07 May 2010. As previously established on the comp.sys.arm newsgroup (thread 1) (thread 2), the TRM description appears to be not only lacking much important information, but also to be quite wrong about much that it does describe.

I resolved to try to work out the correct timing tables experimentally, using the TI OMAP3530 ES3.0 chip on my revision B7 beagleboard as a reference. Because I'm a generous soul, I've decided to publish these for others to refer to. Needless to say, this is not official information from ARM, may contain errors, and ARM is free to change the chip's behaviour at any time - but hopefully it's useful anyway. Just don't hold your breath waiting for me to test all the VFP instructions as well!

In the tables below, opcodes are divided into timing classes, described using a modified regexp syntax which matches all opcodes in the class (as well as sometimes some nonexistent ones). Optional matches are indicated by {}. For these purposes, do not use UAL's inferred-Rd notation (e.g. ADD r0,r1 instead of ADD r0,r0,r1).

Registers are listed in the order in which they appear in the class description - note this differs from the Cortex-A8 TRM, so for example Rd will normally be listed first. I find this easier to follow, and I don't have to keep checking whether Rm comes before or after Rn for each particular instruction.

Notation for timings is based on ARM's syntax:

[] on a source means it is needed if the instruction is conditional
[] on a result means it is output if the instruction is a flag-setting variant
() on a result means it is output if the instruction uses writeback

Where a register is only present in some instructions in a given class (e.g. accumulator variants of multiplies, or registers which may be substituted with an immediate constant), I list timings for all possible registers. Which columns of the table are not applicable for particular instructions from that class should be unambiguous from the opcode class description, so I haven't used {} here where ARM do.

Processor flags (NZCV and GE[3:0]) act as a single item, and are listed under the "flags" column. The Q flag appears to be handled separately, and is never responsible for stalls.

As in the TRM, source timings are given relative to the first cycle of an instruction, and (apart from LDM and STM) result timings are relative to the last cycle.

When PC is used as a source and/or destination register, you can treat it as though it is always required and/or produced in E1 (or any other fixed number), overriding the timings below.

Scheduling is done statically - that is, allowing sufficient time to be be able to handle each input register and flag having any possible value.

To reiterate and expand upon the TRM's description of the hazards:

Load/store resource hazard: only one load/store can be performed per cycle, though it can be in either pipeline. This applies to CLREX, LDM, LDR, PLD, STM and STR, but not SWP.
Multiply resource hazard: multiplies (including the single-cycle ones and USAD) cannot be issued to pipeline 1.
Branch resource hazard: this is a special case of the data source hazard, when you consider that the branch instructions have PC as an input and output.
Data output hazard: instructions with the same output register(s) cannot be issued in the same cycle. As a special case, this does not apply to the flags, so two flag-setting instructions can be dual-issued.
Data source hazard: sufficient stalls will be inserted before an instruction in order that by the time it reaches the pipeline stage at which it requires a given register, the preceding instruction that output to that same register has progressed beyond the pipeline stage at which it output it.
Multi-cycle instruction hazard: if an instruction takes more than one cycle, it cannot be issued to pipeline 1. Pipeline 1 cannot be used again until the last cycle of the instruction, and in some rare cases described below, not until the cycle after.

I have dropped the "E" prefixes from pipeline stages in the tables. This is because in multi-cycle instructions, it is impossible to distinguish empirically between (say) whether a register is needed by E2 in the first cycle or by E1 in the second cycle - nor is this an important distinction for scheduling. By breaking the direct link to pipeline stage, I can refer to numbers outside the range 1-5, which is useful sometimes.

"shift" is a shorthand for (LSL|LSR|ASR|ROR).

Cycles	Source										Result
Cycles	flags	r	r	r	r	r	r	r	r	r	flags	r	r	r	r	r	r	r	r	r
Arithmetic and logical instructions
(ADC\|ADD\|AND\|BIC\|EOR\|ORR\|RSB\|RSC\|SBC\|SUB){S}\|CMN\|CMP\|TEQ\|TST {r,}r,(#\|r)
1	[2]	[2]	2	2							[2]	2	-	-
(ADC\|ADD\|AND\|BIC\|EOR\|ORR\|RSB\|RSC\|SBC\|SUB){S}\|CMN\|CMP\|TEQ\|TST {r,}r,r,shift (#\|r)
1	[2]	[2]	2	1	1						[2]	2	-	-	-
(ADC\|ADD\|AND\|BIC\|EOR\|ORR\|RSB\|RSC\|SBC\|SUB){S}\|CMN\|CMP\|TEQ\|TST {r,}r,r,RRX
2	2	[3]	3	2							[2]	2	-	-
Flags are needed unconditionally as a source for ADC, SBC and RSC. ADR behaves like ADD\|SUB rd,pc,# (unsurprisingly). If the destination is PC: flags are not needed until 3 a branch mispredict happens if the source is an immediate constant or if the S flag is set
Move, NOT and bitfield instructions
(MOV\|MVN){S\|W}\|RBIT\|REV\|REV16\|REVSH\|SBFX\|UBFX r,(#\|r){,#}{,#}
1	[2]	[2]	1								[2]	1/2	-
BFC\|BFI\|MOVT r{,r},#{,#}
1	[2]	2	1								-	2	-
(MOV\|MVN){S} r,r,shift (#\|r)
1	[2]	[2]	1	1							[2]	1/2	-	-
(MOV\|MVN){S} r,r,RRX
2	2	[3]	2								[2]	1/2	-
1/2 means the output register is available at 2 or 1 if conditional or not, respectively. If the destination is PC: flags are not needed until 3 a branch mispredict happens if the source is an immediate constant or if the S flag is set
Multiply instructions
{SM}(MUL\|MLA\|MLS){R} r,r,r{,r}
2	[2]	[3]	1	1	3/5						-	5	-	-	-
(MUL\|MLA)S r,r,r{,r}
6	[2]	[3]	1	1	3/5						2	1	-	-	-
SMUL(W\|x)y\|SMU(A\|S)D{X}\|USAD8 r,r,r
1	[2]	[2]	1	1							-	5	-	-
SMLA(W\|x)y\|SML(A\|S)D{X}\|USADA8 r,r,r,r
2	[2]	[3]	2	2	3/5						-	5	-	-	-
(S\|U)MULL r,r,r,r
3	[2]	[3]	[3]	1	1						-	4	5	-	-
(S\|U)MULLS r,r,r,r
7	[2]	[3]	[3]	1	1						2	0	1	-	-
(S\|U)MLAL\|UMAAL r,r,r,r
3	[2]	2	1	1	1						-	4	5	-	-
(S\|U)MLALS r,r,r,r
7	[2]	2	1	1	1						2	0	1	-	-
SMLALxy\|SML(A\|S)LD{X} r,r,r,r
2	[2]	1	2	1	1						-	4	5	-	-
3/5 means the register is not required until 5 if it was the result of a previous multiply instruction, otherwise it is needed at 3.
Parallel and saturating arithmetic instructions
(S\|U)(ASX\|SAX) r,r,r
1	[2]	[2]	2	1							2	2	-	-
(S\|U)(ADD\|SUB)(8\|16) r,r,r
1	[2]	[2]	2	2							2	2	-	-
(Q\|SH\|UQ\|UH)(ASX\|SAX)\|QDADD r,r,r
1	[2]	[2]	2	1							-	3	-	-
(Q\|SH\|UQ\|UH)(ADD\|SUB)(8\|16)\|QADD r,r,r
1	[2]	[2]	2	2							-	3	-	-
Integer extend instructions
(S\|U)XTA(B\|B16\|H) r,r,r
1	[2]	2	1	1							-	2	-	-
(S\|U)XT(B\|B16\|H) r,r
1	[2]	[2]	1								-	1/2	-
1/2 means the output register is available at 2 or 1 if conditional or not, respectively.
Saturation instructions
(S\|U)SAT r,#,r{,shift #}
1	[2]	2	1								-	3	-
(S\|U)SAT16 r,#,r
1	[2]	2	2								-	3	-
Count leading zeros instruction
CLZ r,r
1	[2]	[2]	2								-	2	-
Pack halfword instructions
PKH(BT\|TB) r,r,r
1	[2]	2	2	2							-	2	-	-
PKH(BT\|TB) r,r,r,shift #
1	[2]	2	2	1							-	2	-	-
Byte select instruction
SEL
2	2	[3]	2	2							-	2	-	-
PSR transfer instructions
MRS r,CPSR
8	[2]	[5]									-	2
MRS r,SPSR
1	[2]	[2]									-	2
MSR CPSR_f,(#\|r)
1	[2]	1									2	-
MSR CPSR_other,(#\|r)
22	[2]	1									2	-
MSR SPSR_any,(#\|r)
11	[2]	1									-	-
MSR requires pipeline 0 even for the single-cycle case, and you can't dual-issue on its final cycle.
Load instructions
LDR{{S}(B\|H)}{T} r,[r],(#\|(+\|-)r{,shift #})
1	[2]	[2]	1	1							-	3	2	-
LDR{{S}(B\|H)}{T} r,[r],(+\|-)r,RRX
2	2	[3]	2	2							-	3	2	-
LDR{EX}{{S}(B\|H)}\|PLD {r,}[r,(#\|+r{,LSL #2}]{!}
1	[2]	[2]	1	1							-	3	(2)	-
LDR{{S}(B\|H)} r,[r,(+\|-)r,shift #]{!} (other than above)
2	[2]	[3]	2	1							-	3	(2)	-
LDR{{S}(B\|H)} r,[r,(+\|-)r,RRX]{!}
3	2	[4]	3	2							-	3	(2)	-
LDRD r,r,[r],(#\|(+\|-)r)
2	[2]	[2]	[3]	1	2						-	2	3	1	-
LDR{EX}D r,r,[r,(#\|+r)]{!}
2	[2]	[2]	[3]	1	1						-	2	3	(1)	-
LDRD r,r,[r,-r]{!}
3	[2]	[3]	[4]	2	1						-	2	3	(1)	-
If the destination is PC: flags are not needed until 3 (but this has no practical effect since it can't dual-issue with a flag-setting instruction anyway) the cycle count is increased by 1, which will also force the instruction to execute in pipeline 0 if it does not already writeback still happens at the same time as before (so will be one stage earlier relative to the final cycle)
Store instructions
STR{B\|H}{T} r,[r],(#\|(+\|-)r{,shift #})
1	[2]	3	1	1							-	-	2	-
STR{B\|H}{T} r,[r],(+\|-)r,RRX
2	2	4	2	2							-	-	2	-
STR{EX}{B\|H} {r,}r,[r,(#\|+r{,LSL #2})]{!}
1	[2]	[2]	3	1	1						-	3	-	(2)	-
STR{B\|H} r,[r,(+\|-)r,shift #]{!} (other than above)
2	[2]	4	2	1							-	-	(2)	-
STR{B\|H} r,[r,(+\|-)r,RRX]{!}
3	2	5	3	2							-	-	(2)	-
STRD r,r,[r],(#\|(+\|-)r)
2	[2]	3	3	1	2						-	-	-	1	-
STR{EX}D {r,}r,r,[r,(#\|+r)]{!}
2	[2]	[3]	3	3	1	1					-	3	-	-	(1)	-
STRD r,r,[r,-r]{!}
3	[2]	4	4	2	1						-	-	-	(1)	-
Swap instructions
SWP{B} r,r,[r]
4	[2]	5	3	1							-	2	-	-
Clear-exclusive instruction
CLREX
1	[2]										-
Load multiple instructions
LDM r{!},{r}{^}
2	[2]	1	-								-	(2)	3
LDM r{!},{r,r}{^}
2	[2]	1	-	-							-	(2)	3	4
LDM r{!},{r,r,r}{^}
2	[2]	1	-	-	-						-	(2)	3	4	5
LDM r{!},{r,r,r,r}{^}
3	[2]	1	-	-	-	-					-	(2)	3	4	5	5
LDM r{!},{r,r,r,r,r}{^}
3	[2]	1	-	-	-	-	-				-	(2)	3	4	5	5	6
LDM r{!},{r,r,r,r,r,r}{^}
4	[2]	1	-	-	-	-	-	-			-	(2)	3	4	5	5	6	6
LDM r{!},{r,r,r,r,r,r,r}{^}
4	[2]	1	-	-	-	-	-	-	-		-	(2)	3	4	5	5	6	6	7
LDM r{!},{r,r,r,r,r,r,r,r}{^}
5	[2]	1	-	-	-	-	-	-	-	-	-	(2)	3	4	5	5	6	6	7	7
Result timings are given relative to the first cycle of the instruction. LDM can't dual-issue on its final cycle if the number of registers loaded is an odd number >= 3. If PC is in the output list: flags are not needed until 3 (but this has no practical effect since it can't dual-issue with a flag-setting instruction anyway) the cycle count is increased by 2 If the ^ flag is used to access USR mode registers, the instruction takes 16 cycles longer to execute, dual-issue on the final cycle is prohibited however many registers are loaded, and there can be no stalls due to register contention with instructions either before or after.
Store multiple instructions
STM r{!},{r}{^}
2	[2]	1	3								-	(2)	-
STM r{!},{r,r}{^}
2	[2]	1	3	3							-	(2)	-	-
STM r{!},{r,r,r}{^}
2	[2]	1	3	3	4						-	(2)	-	-	-
STM r{!},{r,r,r,r}{^}
3	[2]	1	3	3	4	4					-	(2)	-	-	-	-
STM r{!},{r,r,r,r,r}{^}
3	[2]	1	3	3	4	4	5				-	(2)	-	-	-	-	-
STM r{!},{r,r,r,r,r,r}{^}
4	[2]	1	3	3	4	4	5	5			-	(2)	-	-	-	-	-	-
STM r{!},{r,r,r,r,r,r,r}{^}
4	[2]	1	3	3	4	4	5	5	6		-	(2)	-	-	-	-	-	-	-
STM r{!},{r,r,r,r,r,r,r,r}{^}
5	[2]	1	3	3	4	4	5	5	6	6	-	(2)	-	-	-	-	-	-	-	-
Result timings are given relative to the first cycle of the instruction. STM can always dual-issue on its final cycle. If the ^ flag is used to access USR mode registers, the instruction takes 7 cycles longer to to execute, and there can be no stalls due to register contention with instructions either before or after.
Branch instructions
B label (consider PC as an additional source and result register)
1	[3]	1									-	1
BX{J} r (consider PC as an additional source and result register)
1	[3]	2	1								-	-	1
BL{X} (label\|r) (consider LR and PC as additional source and result registers)
1	[3]	2	[2]	1							-	-	3	1
BXJ always takes a branch mispredict penalty. BL{X} only issues to pipeline 0 even though it's a 1-cycle instruction (so there is no practical effect of flags being required at 3). Predicted branch timings are highly variable. Tight loops in particular often have extra stalls. Branch mispredict results in 12 clear stall cycles after the instruction, but with a quirk that if the branch was issued from pipeline 0 then if the instruction after the stall could go in either pipeline then it goes to pipeline 1. So it's beneficial to schedule branches which are likely to mispredict in pipeline 1.
Endianness-setting instruction
SETEND BE\|LE
10
Debug and barrier instructions
DBG\|DMB\|DSB\|ISB #
1
DBG issues to pipeline 0, and pipeline 1 cannot be used during the cycle in which it executes. ISB is followed by 12 or 13 stall cycles if issued in pipeline 0 or 1 respectively, with the same quirk as branch mispredicts about issuing subsequent ALU instructions to pipeline 0 if the ISB is issued in pipeline 0. DMB and DSB are followed by 28 or 29 stall cycles if issued in pipeline 0 or 1 respectively, and also have the same quirk as above.
No-op instructions
NOP\|PLI\|SEV\|WFE\|YIELD
1	[2]										-