

# Linux User Group of Davis

# Marc J. Miller

Strategic Alliance Manager, AMD October 7, 2003

# x86 in High Performance Computing



- The Six System Challenges

*#6: Watt density: #5: The I/O infrastructure: #4: Addressable memory: #3: Memory bandwidth:* 

#### #2: Cost per processing node:

#### #1: Backward compatible to x86-32:

- There is a enormous investment is IA32 for all market segments. In many applications, porting code is not an option.
  - Provide a solution that is not only 100% backwardcompatible, but designed to run IA32 code faster then any existing 32-bit architecture available.
  - Provide a gradual and controlled migration path for porting to AMD64
  - Make the total cost of ownership minimal.

Advanced AMD Opteron<sup>™</sup> Processor System Architecture



- Integrated memory controller
  - Low latency memory access speeds processing
- Separate Memory and I/O pathways
  - Eliminates I/O and memory bus competition
- Each processor has more memory & I/O paths – Memory and I/O bandwidth scales well
- Modular glueless logic using HyperTransport<sup>™</sup> technology bus
  - Fewer chips and lower cost implementation

# **Typical System**



- Must access memory through Northbridge
  Longer latency memory access
- Memory and I/O access on the same bus
  - I/O and Memory compete for bandwidth
- Memory or I/O paths originate from Northbridge
  - Bandwidth does not scale well with more CPUs
- System logic uses more chips and many buses
  - Systems cost more to design, build and test

## AMD Opteron<sup>™</sup> System



- Scalable memory and I/O bandwidth
  - Up to 8 processors without glue logic
  - Each processor adds more memory
  - Each processor adds additional HyperTransport<sup>™</sup> technology buses for more PCI-X and other I/O bridges
  - Fewer chips required

# **Typical MP System**



- System scalability limited by Northbridge
  - Maximum of 4 processors
    Processors compete for FSB bandwidth
  - Memory size and bandwidth are limited
  - Maximum of 3 PCI-X bridges
  - Many more chips required

# **Typical Multiprocessing System**





## Intel Xeon – Light Load





## Intel Xeon – Heavy Load





# AMD Opteron<sup>™</sup> Processor 4P 800 Series AMD

### **Processor-based Server**



 Idle Latencies to First Data •1P System: <80ns •0-Hop in DP System: <80ns •0-Hop in 4P System: ~100ns •1-Hop in MP System: <115ns

•2-Hop in MP System: <150ns

•3-Hop in MP System: <190ns

## AMD Opteron<sup>™</sup> Heavy Load





# AMD64 Technology An AMD64 PC can run both 32- and 64-bit operating systems



# **AMD64 Technology**

<u>ACKUOW</u>



# AMD64 Technical Overview







- A natural evolution of the current 32-bit architecture
- Similar to the 16- to 32-bit conversion of the 386
- Designed to retain compatibility with the current installed base of x86 operating systems and applications
- Low-risk and low-cost path to high-performance computing

- There are many other 64-bit RISC solutions
- Each is a unique instruction set, all of which are incompatible with today's 32-bit code
- All require unique OS and applications

## AMD64 Technology

# 

### Building a Bridge from the 32- to the 64-bit World

- Leverages the initial success of AMD Athlon<sup>™</sup> MP processor
- Adds 64-bit capabilities to the world's highest performing 32-bit core for 2P and 4P servers
- Current 32-bit applications will work on both 32-bit and 64bit operating systems
- Doesn't require special hardware or investment in a proprietary infrastructure
- Developing a solid ecosystem of motherboards, operating systems, development tools, and device drivers



# AMD64 Computing Strategy (2)

- AMD64 Architecture:
  - -64-bit integer registers
  - –64-bit Virtual Address
  - 52-bit Physical Address
  - Sixteen 64-bit integer regs
  - Sixteen 128-bit SSE regs
  - -SSE2 Instruction Set
  - Double precision scalar and vector operations
  - 16x8-, 8x16-way vector packed integer operations
  - SSE1 already added with AMD Athlon™ MP Processor



# AMD64 Processor Overview

### > Performance

- High-bandwidth integrated memory controller scales with processor frequency and number of processors
  - L2 1MB Cache

## > Compatibility

Approximately 10,000 legacy applications at time of launch

### > Scalability

- Can reduce costs for high-end systems
- Can remove I/O bottlenecks
- Easy multiprocessor scaling
- ■16-bit HyperTransport<sup>™</sup> provides 6.4GB/s peak aggregate bandwidth





# AMD Opteron<sup>™</sup> Processor Integrated Memory Controller



- Designed to run memory controller at processor speeds - not FSB speeds
- Designed to dramatically decrease latency

- AMD Athlon<sup>™</sup> processor 1P platforms achieve ~160 ns best-case latency
- AMD64 architecture is designed to achieve ~80 ns best-case latency
- Latency generally decreases further as the core frequency increases
- Designed to add intelligence without decreasing performance
- Designed to support multiple DDR memories
  - DDR200, DDR266, and DDR333
  - Registered DIMMs
  - Future processor cores planned to support DDR-II, etc.

# AMD64 Processors And Target Systems **AMD**

#### AMD Opteron<sup>™</sup> Processor 200 Series:

- 2-way server & workstation processor
- 144-bit DDR interface per CPU: 200 266, 333 MHz
- Three 16-bit HyperTransport<sup>™</sup> technology links per CPU. Typically, two are used to connect to another CPU and I/O

#### AMD Athlon<sup>™</sup> 64 Processor

- Performance Desktop
  Processor
- 72-bit DDR interface 200, 266, 333, 400 MHz
- One 16-bit HyperTranport technology link

NOTE: The AMD Athlon 64 and AMD Opteron are processors based on AMD64 technology

#### AMD Opteron Processor 800 Series:

- Up to 8-way server processor
- 144-bit DDR interface per CPU: 200, 266, 333 MHz
- Three 16-bit HyperTransport technology links per CPU. Typically all three used to connect to other CPUs & I/O

16-bit HyperTransport Links are at 1600MT/s; provides 6.4GB/s Peak Aggregate Bandwidth

- In test after test, AMD64 technology beats the competition
  - 32-bit performance superior to other 32-bit solutions on the market
  - 64-bit performance superior in terms of performance per dollar spent. In most cases more cost-effective to buy multiple AMD64 machines to get the same performance seen from a single competing 64-bit machine.
- See <a href="http://www.amd.com">http://www.amd.com</a> for the most recent data

|                 | Operating System                                            | Туре            |  |
|-----------------|-------------------------------------------------------------|-----------------|--|
|                 | SuSE Linux Enterprise Server (SLES) 8                       | 32 & 64-bit     |  |
| SuSE            | SuSE Linux 9.0 Personal & Professional                      | 32-bit & 64-bit |  |
|                 | UnitedLinux Version 1.0 code base by UnitedLinux Consortium | 32 & 64-bit     |  |
|                 | Conectiva Linux Enterprise Edition                          | 32-bit          |  |
|                 | Linux AMD64 kernel patches (www.x86-64.org)                 | 64-bit          |  |
| MandrakeSoft    | Mandrake Linux 9.2 (coming soon)                            | 32-bit & 64-bit |  |
|                 | Mandrake Linux Corporate Server 2.1                         | 32-bit & 64-bit |  |
|                 | NetBSD                                                      | 32 & 64-bit     |  |
|                 | Red Hat 9.0                                                 | 32-bit          |  |
|                 | Red Hat Enterprise Linux 3 (coming soon)                    | 32-bit & 64-bit |  |
|                 | Scyld Beowulf Cluster Operating System                      | 32-bit          |  |
|                 | Solaris 9 for x86                                           | 32-bit          |  |
| 📃 🤻 turbolinux. | Turbolinux 8 for AMD64                                      | 32 & 64-bit     |  |
| Microsoft       | Windows® 2000 Server                                        | 32-bit          |  |
| Mindows         | Windows Server 2003                                         | 32-bit & 64-bit |  |

## AMD Athlon<sup>™</sup> 64 Processor Technical

Overview

Grade 5 Plow Bolt

## AMD64 Technology AMD64 Means... Dynamic Scaling

Large-scale simulations and games can be interacted with down to the

lowest component level.

From the largest view down to the smallest bolt, designers can maintain accurate physics at all times.









Overview

## AMD64 Technology AMD64 Means... Additional Registers

### **Real-Time Special Effects**



### Higher Level of Realism





## HyperTransport<sup>™</sup> Technology Interface



### HyperTransport<sup>™</sup> Technology Interface Attributes

- Unidirectional
- DDR-like performance (800MHz = 1600MT/sec)
- 4 bytes wide ... 6.4GB/sec bandwidth



Overview

#### **Reliability and Stability** • ECC Protection

- L1 data cache
- L2 tags and data
- Main memory DRAM (optional)
- Hardware Scrubbing
- Thermal Protection
  - ThermTrip
    - Shuts down processor without motherboard intervention
  - Thermal Diode
    - Works with motherboard circuitry to monitor CPU temperature and work with thermal control hardware (i.e, temp controlled fans, etc.)

## AMD Athlon<sup>™</sup> 64 Infrastructure Support

| Chipsets                  |                                                                                                 |                                                                                      |                                                                                                          |                                                                                 |
|---------------------------|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| AMD<br>Launch<br>Chipsets | AMD-8151™<br>Graphics Tunnel<br>8x AGP                                                          | AMD-8111™<br>I/O Hub<br>ATA 133, USB 2.0<br>10/100 Ethernet                          |                                                                                                          |                                                                                 |
| Discrete<br>Graphics      | <b>VIA</b><br><b>K8T400M+VT8235</b><br>8x AGP, ATA 133<br>8x V-Link, USB 2.0<br>10/100 Ethernet | <b>SiS</b><br><b>755 + 963</b><br>8x AGP, ATA 133<br>USB 2.0, 1394A                  | NVIDIA<br>CrushK8<br>8x AGP, ATA 133<br>Two 10/100 Ethernet<br>USB 2.0                                   | NVIDIA<br>CrushK8S<br>8x AGP, S-ATA, RAID<br>Gigabit Ethernet<br>USB 2.0, 1394A |
| Integrated<br>Graphics    | VIA<br>K8M400+VT8235<br>8x AGP + integrated gfx<br>ATA 133, USB 2.0<br>10/100 Ethernet          | <b>SiS</b><br><b>760 + 963</b><br>8x AGP + Ultra256 gfx<br>ATA 133<br>USB 2.0, 1394A | NVIDIA<br>CrushK8G<br>8x AGP + GeForce4i gfx<br>ATA 133, SATA<br>Two 10/100 Ethernet<br>USB 2.0, 802.11b |                                                                                 |
| I/O Hub                   | <b>ALi<br/>1563</b><br>ATA 133, USB 2.0<br>10/100 Ethernet                                      | party                                                                                | e contact the respection<br>vendors directly for la<br>nedules and information                           | latest                                                                          |



# **Chipset Interfaces**





# HyperTransport<sup>™</sup> Technology and Server Chipset Highlights



## HyperTransport<sup>™</sup> Technology Basics

- HyperTransport<sup>™</sup> Technology buses have two unidirectional point-to-point links:
  - The links can be 2-, 4-, 8-, 16-, or 32-bits wide in each direction
  - HyperTransport<sup>™</sup> links have a data rate up to 1.6 Gigabits/second per pin-pair (800 MHz clock)
  - Total Aggregate Bandwidth = 12.8 Gbytes/second at 32 bits wide
  - AMD Opteron<sup>™</sup> supports three 16-bit HyperTransport<sup>™</sup> links
    - Provides 19.2 Gbytes/second on total data bandwidth



# HyperTransport<sup>™</sup> Technology Clock and Control Signals



- Asynchronous clock forwarding
  - One clock is forwarded for each eight bits in each direction
  - Clocks are double pumped; a 800 MHz clock is used for 1600 Mbit data rate

### • Control line distinguishes command packets

- De-asserted during data packets
- In-band system management & legacy signal transport
  - Eliminates sideband wires, interrupts use messages instead of wires
- Embedded code in back channel messages used for flow control
  - Code indicates how many buffers are available for each virtual channel

- •Commands and interrupts are realized as a 32 bit command word
- •Address and Data is preceded by a 64-bit header

✓6-bit type field – Write, Read, Read Response, Fence & Flush

 $\checkmark 26\text{-bit}$  Command specific field

✓32-bit address field (command specific – Byte or DWORD)



At 800MHz DDR it takes:

• 1.25ns to send a request (32-bits)

• 22.5ns to send a 64B block

A PCI-X write of one 64 byte block takes ~290ns + PCI X I/O latency

# HyperTransport<sup>™</sup> Technology Pin count

### □ Additional control signals

- Power OK (PWROK)
- Reset (RESET\_L)
- Signal to ground ratio is conservatively 4:1
- Optional link power down signals for mobile systems
  - LDT\_Stop
  - DevReq
- □ Power per pin-pair is nil when a HyperTransport<sup>™</sup> technology device is stopped (LDT\_Stop)



PWROK, RESET\_L required for proper reset & init  $V_{\text{HT}}$  routed between devices is required for proper common mode range

| Bus Width (Both Ways) | 2                   | 4                   | 8  | 16 | 32                   |
|-----------------------|---------------------|---------------------|----|----|----------------------|
| Data Pins (total)     | 8                   | 16                  | 32 | 64 | 128                  |
| Clock Pins (total)    | 4                   | 4                   | 4  | 8  | 16                   |
| Control Pins (total)  | 4                   | 4                   | 4  | 4  | 4                    |
| Subtotal (high speed) | 16                  | 24                  | 40 | 76 | 148                  |
| VHT                   | 2                   | 2                   | 3  | 6  | 10                   |
| GND                   | 4                   | 6                   | 10 | 19 | 37                   |
| -                     |                     |                     |    |    |                      |
| PWROK                 | 1                   | 1                   | 1  | 1  | 1                    |
| PWROK<br>RESET_L      | 1                   | 1<br>1              |    |    | 1<br>1               |
|                       | 1<br>1<br><b>24</b> | 1<br>1<br><b>34</b> |    | 1  | 1<br>1<br><b>197</b> |

DC Power per Pin-Pair: Signal to  $V_{LDT}$ /Gnd Ratio:

4 - 9 mW, 6 mW<sub>Typical</sub> 4:1

# HyperTransport<sup>™</sup> Technology Intelligence

- Data movement over the HyperTransport<sup>™</sup> bus does not use any CPU machine cycles.
- External device can write to any address within the processor's physical 40-bit address range without CPU intervention.
- In cases where there are multiple
  HyperTransport<sup>™</sup> technology ports, data can b
  passed between ports without CPU interventi
- Because all devices reside within one physical 2^40 linear space all I/O devices have access to all processors and their associated memory & I/O.











# Building Blocks





# Future Building Blocks



#### HyperTransport<sup>™</sup> Technology Consortium ...www.hypertransport.org lamelot Hifn LSI LOGIC TOSHIBA FuturePlus<sup>®</sup> Systems Power Tools for Bus Analysis PACKETS **CISCO SYSTEMS IOSIPA** ամիրուպիրու QUICKLOGIC nurlogic BROADCOM we connect **GDA** TECHNOLOGIES INC **Tektronix**<sup>®</sup> SPINNAKER We make the net work. **Enabling Innovation** phoen SIERRA Accelerating The Broadband Revolution NDSPE Schlumberger TRANSMETA A CONEXANT BUSINESS

AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, 3DNow! and combinations thereof, AMD-8111, AMD-8131, AMD-8132, and AMD-8151 are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Microsoft and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. Pentium and MMX are registered trademarks of Intel Corporation in the U.S. and/or other jurisdictions. SPEC and SPECfp are registered trademarks of Standard Performance Evaluation Corporation in the U.S. and/or other jurisdictions. Alpha is a trademark of Digital Equipment Corporation. MIPS is a registered trademark of MIPS Technologies, Inc. Other product and company names used in this presentation are for identification purposes only and may be trademarks of their respective companies.