Barcelona Architecture: AMD on the Counterattack
by Anand Lal Shimpi on March 1, 2007 12:05 AM EST- Posted in
- CPUs
Sideband Stack Optimizer
Intel's very first Pentium M introduced a feature Intel called its dedicated stack manager. As its name implies, the dedicated stack manager was used to handle all x86 stack operations (i.e. push, pop, call, return). The purpose of the stack manager was to keep those stack operations, which are frequently used with function calls in code, separate from the rest of the x86 instruction stream sent to the CPU. The dedicated stack manager would handle decode and "execution" of these operations so that they wouldn't clog up the processor's decoders and execution units later in the pipeline. Intel essentially "widened" the core by offloading some operations to separate hardware.
With Barcelona, AMD is introducing a similar technology it is calling a Sideband Stack Optimizer. Stack instructions no longer go through the 3-way decoder and stack operations no longer go through the integer execution units, effectively widening Barcelona at minimal cost. The Sideband Stack Optimizer, like Intel's dedicated stack manager, features its own adder that handles all stack operations. It's a small tweak that can help overall performance, and it's simply one that made sense for AMD to implement.
Faster Loads
When looking at the performance of the Athlon 64 and Intel's Core 2 processors, it's easy to understand why Intel has a strong performance advantage in applications that make heavy use of SSE. But what about applications like gaming and business apps that should greatly benefit from AMD's on-die memory controller? Is the Core 2's larger L2 cache and aggressive prefetchers all that it needs to overcome AMD's on-die memory controller?
One major aspect of Intel's Core micro-architecture advantage is its ability to allow load instructions to bypass previous load and store instructions. On average, about 1/3 of all instructions in a program end up being loads, thus if you can improve load performance you can generally impact overall application performance pretty significantly. With Intel's Core micro-architecture, it's possible for loads to be re-ordered to ensure that instructions dependent on those loads get the data they need without waiting for costly memory accesses.
Core also allowed for loads to be moved ahead of stores, which was previously not allowed due to the possibility that an earlier store could invalidate the data that was just loaded. Intel figured that the possibility of a store writing over a load ends up being very small, on the order of 1 - 2%, therefore with a reasonably accurate predictor you could correctly guess when re-ordering a load ahead of a store was possible. Intel's Core 2 based processors feature prediction logic to guess whether a store and a load share the same memory address; if the predictor determines that they won't, then it allows the load to be re-ordered ahead of the store. In the small chance that the predictor is incorrect however, the load has to be redone at the cost of a pipeline flush (similar to what happens if the processor mispredicts a branch).
AMD's K8 architecture had no equivalent scheme for allowing the out of order execution of loads ahead of other loads and stores, so even without an on-die memory controller Intel was able to execute some memory operations faster than AMD. Barcelona fixes this problem through an almost identical scheme to what Intel implemented in its Core 2 processors.
Barcelona can now re-order loads ahead of other loads, just like Core 2 can. It can also execute loads ahead of other stores, assuming that the processor knows that the two don't share the same memory address. While Intel uses a predictor to determine whether or not the store aliases with the load, AMD takes a more conservative approach. Barcelona waits until the store address is calculated before determining whether or not the load can be processed ahead of it. By doing it this way, Barcelona is never wrong and there's no chance of a mispredict penalty. AMD's designers looked at using a predictor like Intel did but found that it offered no performance improvement on its architecture. AMD can generate up to three store addresses per clock as it has three AGUs (Address Generation Units) compared to Intel's one for stores, so it would make sense that AMD has a bit more execution power to calculate a store address before moving a load ahead of it.
The out of order load execution improvements to Barcelona should prove to be even more effective than they were in Core 2 given that AMD previously couldn't do any reordering of loads before the Int/FP schedulers whereas Core Duo could do a limited amount of re-ordering.
Intel's very first Pentium M introduced a feature Intel called its dedicated stack manager. As its name implies, the dedicated stack manager was used to handle all x86 stack operations (i.e. push, pop, call, return). The purpose of the stack manager was to keep those stack operations, which are frequently used with function calls in code, separate from the rest of the x86 instruction stream sent to the CPU. The dedicated stack manager would handle decode and "execution" of these operations so that they wouldn't clog up the processor's decoders and execution units later in the pipeline. Intel essentially "widened" the core by offloading some operations to separate hardware.
With Barcelona, AMD is introducing a similar technology it is calling a Sideband Stack Optimizer. Stack instructions no longer go through the 3-way decoder and stack operations no longer go through the integer execution units, effectively widening Barcelona at minimal cost. The Sideband Stack Optimizer, like Intel's dedicated stack manager, features its own adder that handles all stack operations. It's a small tweak that can help overall performance, and it's simply one that made sense for AMD to implement.
Faster Loads
When looking at the performance of the Athlon 64 and Intel's Core 2 processors, it's easy to understand why Intel has a strong performance advantage in applications that make heavy use of SSE. But what about applications like gaming and business apps that should greatly benefit from AMD's on-die memory controller? Is the Core 2's larger L2 cache and aggressive prefetchers all that it needs to overcome AMD's on-die memory controller?
One major aspect of Intel's Core micro-architecture advantage is its ability to allow load instructions to bypass previous load and store instructions. On average, about 1/3 of all instructions in a program end up being loads, thus if you can improve load performance you can generally impact overall application performance pretty significantly. With Intel's Core micro-architecture, it's possible for loads to be re-ordered to ensure that instructions dependent on those loads get the data they need without waiting for costly memory accesses.
Core also allowed for loads to be moved ahead of stores, which was previously not allowed due to the possibility that an earlier store could invalidate the data that was just loaded. Intel figured that the possibility of a store writing over a load ends up being very small, on the order of 1 - 2%, therefore with a reasonably accurate predictor you could correctly guess when re-ordering a load ahead of a store was possible. Intel's Core 2 based processors feature prediction logic to guess whether a store and a load share the same memory address; if the predictor determines that they won't, then it allows the load to be re-ordered ahead of the store. In the small chance that the predictor is incorrect however, the load has to be redone at the cost of a pipeline flush (similar to what happens if the processor mispredicts a branch).
AMD's K8 architecture had no equivalent scheme for allowing the out of order execution of loads ahead of other loads and stores, so even without an on-die memory controller Intel was able to execute some memory operations faster than AMD. Barcelona fixes this problem through an almost identical scheme to what Intel implemented in its Core 2 processors.
Barcelona can now re-order loads ahead of other loads, just like Core 2 can. It can also execute loads ahead of other stores, assuming that the processor knows that the two don't share the same memory address. While Intel uses a predictor to determine whether or not the store aliases with the load, AMD takes a more conservative approach. Barcelona waits until the store address is calculated before determining whether or not the load can be processed ahead of it. By doing it this way, Barcelona is never wrong and there's no chance of a mispredict penalty. AMD's designers looked at using a predictor like Intel did but found that it offered no performance improvement on its architecture. AMD can generate up to three store addresses per clock as it has three AGUs (Address Generation Units) compared to Intel's one for stores, so it would make sense that AMD has a bit more execution power to calculate a store address before moving a load ahead of it.
The out of order load execution improvements to Barcelona should prove to be even more effective than they were in Core 2 given that AMD previously couldn't do any reordering of loads before the Int/FP schedulers whereas Core Duo could do a limited amount of re-ordering.
83 Comments
View All Comments
agaelebe - Friday, March 2, 2007 - link
Wow! A lot of dicussion in here.And, by the way, very interesting article.
I'm a software engineer from Brazil and I'm planning to change my PC this year.
I've bem using AMD processors since the K6.
Today I've a XP Mobile 2500+(@2.2ghz), 1gb ram, 200gb and an AGP 6600GT
My PC is not very slow, but I'm thinking in going dual core to speed things up(office applications, web development and some games).
I can run some of the newest games, but not in high graphics.
I expect that my PC can run C&C 3 (Already run the demo in 1024 medium, but have some craches although it's not running it slow)
So, today I'm thinking in 3 options:
1) Stay with this computer and wait until AMD launchs it's new architecture (I pretend to go with an average price Kuma)
2) Go with Intel Core 2 Duo (e6300 or e6400). They're not expensive and for games I can easily make an overclock and gain more performance.
3) Buy a good AM2 board and a cheap Atlhon X2 (3600) and wait new AMD processors and then change only the processor.
Here in Brazil the taxes are to high, so I'm planning in buying a PC with these specs:
- CORE 2 Duo e6300/6400 or X2 3600/3800
- mid-tier motherboard (
- 2 x 1gb DDR 800 4-4-4-12
- 2 x 250 gb
- X1950pro 256 or 512
- 500watts power
So the prices are below:
e6300 box US$ 300 (same price for a X2 4200+ box)
x23800 box US$ 220
motherboard: US$ 220
ram: US$ 400
video: US$ 450
DVD: US$ 70
case: US$ 150
HDs : US$ 250
Power: us$ 180
So I plan to spent about 2000 dollars (Sadly, I can buy this same PC in US for the half of the price).
My new PC should spent not to much power so I can leave it turned onall day long(max 150watts on iddle without monitor), otherwise I'll keep my old computer turned on just for downloding stuff)
So, If someone has an opinion, I'd like to "hear" it. You can give another options to, or make some comments about the specs I'm choosing now.
I had Pentium 75 and after that only AMD CPUs... Should know I surrender to the Core 2 Duo or believe that AMD can really beat it until the end of 2008?
And thanks for the cooperation and patience.
Zebo - Saturday, March 3, 2007 - link
Athlon 64 AM2's arnt exactly slow so if you're an AMD fan just get one..like a 3800+ or 3600+ and overclock it. It will be at least 4x faster than what you have now and accept K8L Agena core later. It will be cheaper than C2D by about $50 USD and You'll also pay cheap for a GeForce 6100 Motherboard which is only $50 USD. Overall expect the the AM2 system to be about $100 USD cheaper.Keep in mind that C2D is 20% faster clock for clock in most apps so it's not exactly a quantum leap here getting a C2D.. Gap gets a lot larger when overclocking since C2D's overclcok higher like 3.2Ghz is common on air vs. only 2.8Ghz for AM2, so, at the end of the day a C2D setup is able to be about 40% faster over most benchmarks. That is getting significant and why enthusiasts are buying C2D's.
agaelebe - Friday, March 2, 2007 - link
And,as always, sorry with the errors and not so good writing...Kiijibari - Thursday, March 1, 2007 - link
Hi,never heard of of that before, does anybody know what it is ?
So far I see 2 pad areas at the DIE photo, therefore I assume that it would be also 2 interfaces, e.g. x8 PCIe like Sun uses ?
bb
Kiijibari
mino - Friday, March 2, 2007 - link
It should be some management/coodrination stuff (can-t remember the name of that bus).Every northbridge and CPU has that.
davecason - Thursday, March 1, 2007 - link
Anand,Great article! I know it took a lot of time and I wanted you to know I really appreciate your effort. It is the kind of article that keeps me coming back to your site.
-Dave
yyrkoon - Thursday, March 1, 2007 - link
Page 5, paragraph 4 'pretty significantly'. Well is it, or is it not it ?
http://www.wikihow.com/Avoid-Colloquial-%28Informa...">http://www.wikihow.com/Avoid-Colloquial-%28Informa...
Aside from my gripe concerning writing style, good article :)
trisweb2 - Friday, March 16, 2007 - link
Usually we criticize writing style based on a whole experience... obviously Anand is one of the best technical review writers on the Internet; if you bother to read his articles more fully perhaps you'd realize that. The colloquial writing sometimes brings it to a more personal level that a reader can better relate to and understand -- it works especially well in this case, where it's a future design, we really don't know how it's going to perform. That he can guess and say "pretty significantly" tells me he understands the uncertainty of the situation, and the language communicates that fact perfectly well. It would be more confusing if he said it would impact performance "significantly" as you want him to, as that would imply that he was more certain than he might actually have been.Masters are allowed to bend the rules, and Anand is one, so lay off.
yyrkoon - Thursday, March 1, 2007 - link
*Is it, or is it not*/me hangs head in shame
baronzemo78 - Thursday, March 1, 2007 - link
Any rough guess as to how Barcelona will compete with Core2 in gaming? Many articles have shown how Core2 gets you a slight FPS boost in games that aren't graphics card limited. I'm curious how Barcelona will fit in with the overall picture of DX10 cards like G80 and R600.