

- Nehalem Design Philosophy
- Enhanced Processor Core
  - Performance Features
  - Simultaneous Multi-Threading
- New Platform
  - New Cache Hierarchy
  - New Platform Architecture
- Performance Acceleration
  - Virtualization
  - New Instructions













- Nehalem Design Philosophy
- Enhanced Processor Core
  - Performance Features
  - Simultaneous Multi-Threading
- New Platform
  - New Cache Hierarchy
  - New Platform Architecture
- Performance Acceleration
  - Virtualization
  - New Instructions









## **Macrofusion Recap**

Introduced in Core 2

TEST/CMP instruction followed by a conditional branch treated as a single instruction

- Decode as one instruction
- Execute as one instruction
- Retire as one instruction

#### Higher *performance*

- Improves throughput
- Reduces execution latency

Improved *power efficiency* 

- Less processing required to accomplish the same work

Intel Developer FORUM

11



## **Nehalem Macrofusion**

Goal: Identify more macrofusion opportunities for increased *performance* and *power efficiency* 

Support all the cases in Core 2 PLUS

- CMP+Jcc macrofusion added for the following branch conditions
  - JL/JNGE
  - JGE/JNL
  - JLE/JNG
  - JG/JNLE

Core 2 only supports macrofusion in 32-bit mode

- Nehalem supports macrofusion in both 32-bit and 64-bit modes

Increased macrofusion benefit on Nehalem







### **Branch Prediction Reminder**

Goal: Keep powerful compute engine fed Options:

- Stall pipeline while determining branch direction/target
- Predict branch direction/target and correct if wrong

Minimize amount of time wasted correcting from incorrect branch predictions

- Performance:
  - Through higher branch prediction accuracy
  - Through faster correction when prediction is wrong
- **Power efficiency:** Minimize number of speculative/incorrect micro-ops that are executed

Continued focus on branch prediction improvements

Intel Developer FORUM

15



## **L2 Branch Predictor**

Problem: Software with a large code footprint not able to fit well in existing branch predictors

- Example: Database applications

Solution: Use multi-level branch prediction scheme Benefits:

- Higher *performance* through improved branch prediction accuracy
- Greater *power efficiency* through less mis-speculation



## Renamed Return Stack Buffer (RSB)

#### Instruction Reminder

- CALL: Entry into functions
- RET: Return from functions

#### Classical Solution

- Return Stack Buffer (RSB) used to predict RET
- RSB can be corrupted by speculative path

#### The **Renamed RSB**

- No RET mispredicts in the common case

FORUM

17



# **Execution Engine**

Start with powerful Core 2 execution engine

- Dynamic 4-wide Execution
- Advanced Digital Media Boost
  - 128-bit wide SSE
- HD Boost (Penryn)
  - SSE4.1 instructions
- Super Shuffler (Penryn)

#### Add Nehalem enhancements

- Additional parallelism for higher performance







# **Enhanced Memory Subsystem**

#### Start with great Core 2 Features

- Memory Disambiguation
- Hardware Prefetchers
- Advanced Smart Cache

#### New Nehalem Features

- New TLB Hierarchy
- Fast 16-Byte unaligned accesses
- Faster Synchronization Primitives

FORUM

21



# **New TLB Hierarchy**

Problem: Applications continue to grow in data size Need to increase TLB size to keep the pace for performance Nehalem adds new low-latency unified 2<sup>nd</sup> level TLB

|                                        | # of Entries |
|----------------------------------------|--------------|
| 1 <sup>st</sup> Level Instruction TLBs |              |
| Small Page (4k)                        | 128          |
| Large Page (2M/4M)                     | 7 per thread |
| 1st Level Data TLBs                    |              |
| Small Page (4k)                        | 64           |
| Large Page (2M/4M)                     | 32           |
| New 2 <sup>nd</sup> Level Unified TLB  |              |
| Small Page Only                        | 512          |



## **Fast Unaligned Cache Accesses**

Two flavors of 16-byte SSE loads/stores exist

- Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary
- Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement

#### Prior to Nehalem

- Optimized for Aligned instructions
- Unaligned instructions slower, lower throughput -- Even for aligned accesses!

  Required multiple uops (not energy efficient)

Compilers would largely avoid unaligned load

2-instruction sequence (MOVSD+MOVHPD) was faster

- Nehalem optimizes Unaligned instructions Same speed/throughput as Aligned instructions on aligned accesses
- Optimizations for making accesses that cross 64-byte boundaries fast
- Lower latency/higher throughput than Core 2
- Aligned instructions remain fast

No reason to use aligned instructions on Nehalem! Benefits:

- Compiler can now use unaligned instructions without fear
- *Higher performance* on key media algorithms
- More *energy efficient* than prior implementations

Intel Developer FORUM



# **Faster Synchronization Primitives**

Multi-threaded software becoming more prevalent

Scalability of multi-thread applications can be limited by synchronization

Synchronization primitives: LOCK prefix, XCHG

Reduce synchronization latency for legacy software



Greater thread scalability with Nehalem



## Simultaneous Multi-Threading (SMT)

#### **SMT**

- Run 2 threads at the same time per core Take advantage of 4-wide execution engine

- Keep it fed with multiple threads
- Hide latency of a single thread

# Most *power efficient* performance feature

- Very low die area cost
- Can provide significant performance benefit depending on application
- Much more efficient than adding an entire core

#### Nehalem advantages

- Larger caches
- Massive memory BW



Simultaneous multi-threading enhances performance and energy efficiency

## **SMT Implementation Details**

Multiple policies possible for implementation of SMT Replicated – Duplicate state for SMT

- Register state
- Renamed RSB
- Large page ITLB

Partitioned – Statically allocated between threads

- Key buffers: Load, store, Reorder
- Small page ITLB

Competitively shared - Depends on thread's dynamic behavior

- Reservation station
- Caches
- Data TLBs, 2<sup>nd</sup> level TLB

#### Unaware

- Execution units

Intel Developer FORUM



26

- Nehalem Design Philosophy
- Enhanced Processor Core
  - Performance Features
  - Simultaneous Multi-Threading

#### Feeding the Engine

- New Memory Hierarchy
- New Platform Architecture
- Performance Acceleration
  - Virtualization
  - New Instructions

Intel Developer FORUM

27



# **Feeding the Execution Engine**

Powerful 4-wide dynamic execution engine Need to keep providing fuel to the execution engine Nehalem Goals

- Low latency to retrieve data
  - Keep execution engine fed w/o stalling
- High data bandwidth
  - Handle requests from multiple cores/threads seamlessly
- Scalability
  - Design for increasing core counts

Combination of great *cache hierarchy* and *new platform* 

Nehalem designed to feed the execution engine























## **Nehalem-EP Platform Architecture**

**Integrated Memory Controller** 

- 3 DDR3 channels per socket
- Massive memory bandwidth
- Memory Bandwidth scales with # of processors
- Very *low memory latency*

QuickPath Interconnect (QPI)

- New point-to-point interconnect
- Socket to socket connections
- Socket to chipset connections
- Build scalable solutions



Significant performance leap from new platform

Intel Developer FORUM

39



## **QuickPath Interconnect**

Nehalem introduces new QuickPath Interconnect (QPI)

High bandwidth, low latency point to point interconnect

Up to 6.4 GT/sec initially

- 6.4 GT/sec -> 12.8 GB/sec
- Bi-directional link -> 25.6
   GB/sec per link
- Future implementations at even higher speeds

Highly *scalable* for systems with varying # of sockets





FORUM

(intel)

## **Integrated Memory Controller (IMC)**

Memory controller optimized per market segment

#### Initial Nehalem products

- Native DDR3 IMC
- Up to 3 channels per socket
- Speeds up to DDR3-1333
- Massive memory bandwidth
- Designed for low latency - Support RDIMM and UDIMM
- RAS Features

#### Future products

- Scalability
  - Vary # of memory channels

  - Increase memory speeds
     Buffered and Non-Buffered solutions
- Market specific needs
  - Higher memory capacity
  - Integrated graphics



Significant performance through new IMC





# **IMC Memory Bandwidth (BW)**

3 memory channels per socket Up to DDR3-1333 at launch

- Massive memory BW
- HEDT: 32 GB/sec peak
- 2S server: 64 GB/sec peak

#### Scalability

- Design IMC and core to take advantage of BW
- Allow performance to scale with cores
  - Core enhancements
    - ✓ Support more cache misses per
    - ✓ Aggressive hardware prefetching 
      w/ throttling enhancements
  - Example IMC Features
    - ✓ Independent memory channels
    - ✓ Aggressive Request Reordering



Massive memory BW provides performance and scalability



## Non-Uniform Memory Access (NUMA)

#### FSB architecture

- All memory in one location Starting with Nehalem
- Memory located in multiple places

Latency to memory dependent on location

Local memory

- Highest BW
- Lowest latency

Remote Memory

- Higher latency



Ensure software is NUMA-optimized for best performance

FORUM

43



# **Local Memory Access**

CPU0 requests cache line X, not present in any CPU0 cache

- CPU0 requests data from its DRAM
- CPU0 snoops CPU1 to check if data is present

#### Step 2:

- DRAM returns data
- CPU1 returns snoop response

Local memory latency is the maximum latency of the two responses Nehalem optimized to keep key latencies close to each other



44





- Nehalem Design Philosophy
- Enhanced Processor Core
  - Performance Features
  - Simultaneous Multi-Threading
- Feeding the Engine
  - New Memory Hierarchy
  - New Platform Architecture
- Performance Acceleration
  - Virtualization
  - New Instructions

Intel Developer FORUM

47



#### **Virtualization**

To get best virtualized performance

- Have best native performance
- Reduce:
  - # of transitions into/out of virtual machine
  - Latency of transitions

Nehalem virtualization features

- Reduced latency for transitions
- Virtual Processor ID (VPID) to reduce effective cost of transitions
- Extended Page Table (EPT) to reduce # of transitions

Great virtualization performance w/ Nehalem

Intel Developer FORUM intel

## **Latency of Virtualization Transitions**

#### Microarchitectural

- Huge latency reduction generation over generation
- Nehalem continues the trend

#### Architectural

- Virtual Processor ID (VPID) added in Nehalem
- Removes need to flush TLBs on transitions



Higher Virtualization Performance Through Lower Transition Latencies

Intel Developer FORUM

49



## **Extended Page Tables (EPT) Motivation**



- A VMM needs to protect physical memory
  - Multiple Guest OSs share the same physical memory
  - Protections are implemented through page-table virtualization
- Page table virtualization accounts for a significant portion of virtualization overheads
  - VM Exits / Entries
- The goal of EPT is to reduce these overheads

















fastest CRC32C software algorithm by a big margin

## **Tools Support of New Instructions**

- Intel Compiler 10.x supports the new instructions
  - SSE4.2 supported via intrinsics
  - ➤ Inline assembly supported on both IA-32 and Intel64 targets
  - Necessary to include required header files in order to access intrinsics
    - √<<u>tmm</u>intrin.h> for Supplemental SSE3
    - ✓ < smmintrin.h> for SSE4.1
    - ✓ < <u>nmm</u>intrin.h> for SSE4.2
- Intel Library Support
  - XML Parser Library using string instructions will beta Spring '08 and release product in Fall '08
  - > IPP is investigating possible usages of new instructions
- Microsoft Visual Studio 2008 VC++
  - SSE4.2 supported via intrinsics
  - Inline assembly supported on IA-32 only
  - Necessary to include required header files in order to access intrinsics
    - ✓ < tmmintrin.h> for Supplemental SSE3
    - ✓ < smmintrin.h > for SSE4.1 ✓ < nmmintrin.h > for SSE4.2
  - VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions





# **Software Optimization Guidelines**

Most optimizations for Core microarchitecture still hold

Examples of new optimization guidelines:

- 16-byte unaligned loads/stores
- Enhanced macrofusion rules
- NUMA optimizations

Nehalem SW Optimization Guide will be published Intel Compiler will support settings for Nehalem optimizations

FORUM

59



## **Summary**

Nehalem - The 45nm Tock

Designed for

- Power Efficiency
- Scalability
- Performance

**Enhanced Processor Core** 

Brand New Platform Architecture

Extending ISA Leadership



# Additional sources of information on this topic:

NGMS002-Upcoming Intel® 64 Instruction Set Architecture Extensions

- April 3 16:00 - 17:50; Auditorium, 3rd floor

NGMC001-Next Generation Intel® Microarchitecture (Nehalem) and New Instruction Extensions - Chalk Talk

- April 3 17:50 - 18:30; Auditorium, 3rd floor

Demos in the Advance Technology Zone on the 3rd floor -

More web based info: <a href="http://www.intel.com/technology/architecture-silicon/next-gen/index.htm">http://www.intel.com/technology/architecture-silicon/next-gen/index.htm</a>

FORUM

6



## **Session Presentations - PDFs**

The PDF of this Session presentation is available from our IDF Content Catalog:

https://intel.wingateweb.com/SHchina/catalog/controller/catalog

These can also be found from links on www.intel.com/idf

FORUM

intel

# Please Fill out the Session Evaluation Form

Put in your lucky draw coupon to win the prize at the end of the track!

You must be present to win!

Thank You for your input, we use it to improve future Intel Developer Forum events

FORUM

63



## **Legal Disclaimer**

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL® TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

[ADD ANY CODE NAMES FROM FOILS] and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Intel, Intel Inside, [ADD ANY OTHER INTEL TRADEMARKS OR LOGO IN FOILS] and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

\*Other names and brands may be claimed as the property of others.

Copyright ° 2008 Intel Corporation.

change without notice.



#### **Risk Factors**

Risk Factors

This presentation contains forward-looking statements that involve a number of risks and uncertainties. These statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other similar transactions that may be completed in the future. The information presented is accurate only as of today's date and will not be updated. In addition to any factors discussed in the presentation, the important factors that could cause actual results to differ materially include the following: Factors that could cause demand to be different from Intel's expectations include changes in business and economic conditions, including conditions in the credit market that could affect consumer confidence; customer acceptance of Intel's and competitors' products; changes in customer order patterns, including order cancellations; and changes in the level of inventory at customers. Intel's results could be affected by the timing of closing of acquisitions and divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Additionally, Intel is in the process of transitioning to its next generation of products on 45 nm process technology, and there could be execution issues associated with these changes, including product defects and errata along with lower than anticipated manufacturing yleids. Revenue and the gross margin percentage are affected by the timing of new Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pring pressures and Intel's response to such actions; Intel's ability to respond quickly to technological developments and to incorporate new features into its products; and the availability of sufficient components from suppliers to meet demand. The

