《并行计算机体系结构英文版·第2版》

作者	（美）（D.E.卡勒）David E.Culler等著编者
出版	北京：机械工业出版社
参考页数	1025
出版时间	1999（求助前请核对）目录预览
ISBN号	7111074408 — 求助条款
PDF编号	88812038（仅供预览，未存储实际文件）
求助格式	扫描PDF（若分多册发行，每次仅能受理1册）

系统维护中...

1 Introduction1

1．1 Why Parallel Architecture4

1．1．1 Application Trends6

1．1．2 Technology Trends12

1．1．3 Architectural Trends14

1．1．4 Supercomputers21

1．1．5 Summary23

1．2 Convergence of Parallel Architectures25

1．2．1 Communication Architecture25

1．2．2 Shared Address Space28

1．2．3 Message Passing37

1．2．4 Convergence42

1．2．5 Data Parallel Processing44

1．2．6 Other Parallel Architectures47

1．2．7 A Generic Parallel Architecture50

1．3 Fundamental Design Issues52

1．3．1 Communication Abstraction53

1．3．2 Programming Model Requirements53

1．3．3 Communication and Replication58

1．3．4 Performance59

1．4 Concluding Remarks63

1．3．5 Summary63

1．5 Historical References66

1．6 Exercises70

2 Parallel Programs75

2．1 Parallel Application Case Studies76

2．1．1 Simulating Ocean Currents77

2．1．2 Simulating the Evolution of Galaxies78

2．1．3 Visualizing Complex Scenes Using Ray Tracing79

2．1．4 Mining Data for Associations80

2．2 The Parallelization Process81

2．2．1 Steps in the Process82

2．2．2 Parallelizing Computation versus Data90

2．2．3 Goals of the Parallelization Process91

2．3 Parallelization of an Example Program92

2．3．1 The Equation Solver Kernel92

2．3．2 Decomposition93

2．3．3 Assignment98

2．3．4 Orchestration under the Data Parallel Model99

2．3．5 Orchestration under the Shared Address Space Model101

2．3．6 Orchestration under the Message-Passing Model108

2．4 Concluding Remarks116

2．5 Exercises117

3 Programming for Performance121

3．1．1 Load Balance and Synchronization Wait Time123

3．1 Partitioning for Performance123

3．1．2 Reducing Inherent Communication131

3．1．3 Reducing the Extra Work135

3．1．4 Summary136

3．2 Data Access and Communication in a Multimemory System137

3．2．1 A Multiprocessor as an Extended Memory Hierarchy138

3．2．2 Artifactual Communication in the Extended Memory Hierarchy139

3．2．3 Artifactual Communication and Replication：The Working Set Perspective140

3．3 Orchestration for Performance142

3．3．1 Reducing Artifactual Communication142

3．3．2 Structuring Communication to Reduce Cost150

3．4 Performance Factors from the Processor s Perspective156

3．5 The Parallel Application Case Studies：An In-Depth Look160

3．5．1 Ocean161

3．5．2 Barnes-Hut166

3．5．3 Raytrace174

3．5．4 Data Mining178

3．6 Implications for Programming Models182

3．6．1 Naming184

3．6．2 Replication184

3．6．3 Overhead and Granularity of Communication186

3．6．4 Block Data Transfer187

3．6．6 Hardware Cost and Design Complexity188

3．6．5 Synchronization188

3．6．7 Performance Model189

3．6．8 Summary189

3．7 Concluding Remarks190

3．8 Exercises192

4 Workload-Driven Evaluation199

4．1 Scaling Workloads and Machines202

4．1．1 Basic Measures of Multiprocessor Performance202

4．1．2 Why Worry about Scaling?204

4．1．3 Key Issues in Scaling206

4．1．4 Scaling Models and Speedup Measures207

4．1．5 Impact of Scaling Models on the Equation Solver Kernel211

4．1．6 Scaling Workload Parameters213

4．2 Evaluating a Real Machine215

4．2．1 Performance Isolation Using Microbenchmarks215

4．2．2 Choosing Workloads216

4．2．3 Evaluating a Fixed-Size Machine221

4．2．4 Varying Machine Size226

4．2．5 Choosing Performance Metrics228

4．3 Evaluating an Architectural Idea or Trade-off231

4．3．1 Multiprocessor Simulation233

4．3．2 Scaling Down Problem and Machine Parameters for Simulation234

4．3．3 Dealing with the Parameter Space：An Example Evaluation238

4．4 Illustrating Workload Characterization243

4．3．4 Summary243

4．4．1 Workload Case Studies244

4．4．2 Workload Characteristics253

4．5 Concluding Remarks262

4．6 Exercises263

5 Shared Memory Multiprocessors269

5．1 Cache Coherence273

5．1．1 The Cache Coherence Problem273

5．1．2 Cache Coherence through Bus Snooping277

5．2 Memory Consistency283

5．2．1 Sequential Consistency286

5．2．2 Sufficient Conditions for Preserving Sequential Consistency289

5．3 Design Space for Snooping Protocols291

5．3．1 A Three-State（MSI）Write-Back Invalidation Protocol293

5．3．2 A Four-State（MESI）Write-Back Invalidation Protocol299

5．3．3 A Four-State（Dragon）Write-Back Update Protocol301

5．4 Assessing Protocol Design Trade-offs305

5．4．1 Methodology306

5．4．2 Bandwidth Requirement under the MESI Protocol307

5．4．3 Impact of Protocol Optimizations311

5．4．4 Trade-Offs in Cache Block Size313

5．4．5 Update-Based versus Invalidation-Based Protocols329

5．5 Synchronization334

5．5．1 Components of a Synchronization Event335

5．5．2 Role of the User and System336

5．5．3 Mutual Exclusion337

5．5．4 Point-to-Point Event Synchronization352

5．5．5 Global（Barrier）Event Synchronization353

5．5．6 Synchronization Summary358

5．6 Implications for Software359

5．7 Concluding Remarks366

5．8 Exercises367

6 Snoop-Based Multiprocessor Design377

6．1 Correctness Requirements378

6．2 Base Design：Single-Level Caches with an Atomic Bus380

6．2．1 Cache Controller and Tag Design381

6．2．2 Reporting Snoop Results382

6．2．3 Dealing with Write Backs384

6．2．4 Base Organization385

6．2．5 Nonatomic State Transitions385

6．2．6 Serialization388

6．2．7 Deadlock390

6．2．8 Livelock and Starvation390

6．2．9 Implementing Atomic Operations391

6．3 Multilevel Cache Hierarchies393

6．3．1 Maintaining Inclusion394

6．3．2 Propagating Transactions for Coherence in the Hierarchy396

6．4 Split-Transaction Bus398

6．4．1 An Example Split-Transaction Design400

6．4．2 Bus Design and Request-Response Matching400

6．4．3 Snoop Results and Conflicting Requests402

6．4．4 Flow Control404

6．4．5 Path of a Cache Miss404

6．4．6 Serialization and Sequential Consistency406

6．4．7 Alternative Design Choices409

6．4．8 Split-Transaction Bus with Multilevel Caches410

6．4．9 Supporting Multiple Outstanding Misses from a Processor413

6．5 Case Studies：SGI Challenge and Sun Enterprise 6000415

6．5．1 SGI Powerpath-2 System Bus417

6．5．2 SGI Processor and Memory Subsystems420

6．5．3 SGI I/O Subsystem422

6．5．4 SGI Challenge Memory System Performance424

6．5．5 Sun Gigaplane System Bus424

6．5．6 Sun Processor and Memory Subsystem427

6．5．7 Sun I/O Subsystem429

6．5．8 Sun Enterprise Memory System Performance429

6．5．9 Application Performance429

6．6 Extending Cache Coherence433

6．6．1 Shared Cache Designs434

6．6．2 Coherence for Virtually Indexed Caches437

6．6．3 Translation Lookaside Buffer Coherence439

6．6．4 Snoop-Based Cache Coherence on Rings441

6．6．5 Scaling Data and Snoop Bandwidth in Bus-Based Systems445

6．7 Concluding Remarks446

6．8 Exercises446

7 Scalable Multiprocessors453

7．1 Scalability456

7．1．1 Bandwidth Scaling457

7．1．2 Latency Scaling460

7．1．3 Cost Scaling461

7．1．4 Physical Scaling462

7．1．5 Scaling in a Generic Parallel Architecture467

7．2 Realizing Programming Models468

7．2．1 Primitive Network Transactions470

7．2．2 Shared Address Space473

7．2．3 Message Passing476

7．2．4 Active Messages481

7．2．5 Common Challenges482

7．2．6 Communication Architecture Design Space485

7．3 Physical DMA486

7．3．1 Node-to-Network Interface486

7．3．3 A Case Study：nCUBE/2488

7．3．2 Implementing Communication Abstractions488

7．3．4 Typical LAN Interfaces490

7．4 User-Lervel Access491

7．4．1 Node-to-Network Interface491

7．4．2 Case Study：Thinking Machines CM-5493

7．4．3 User-Level Handlers494

7．5 Dedicated Message Processing496

7．5．1 Case Study：Intel Paragon499

7．5．2 Case Study：Meiko CS-2503

7．6 Shared Physical Address Space506

7．6．1 Case Study：CRAY T3D508

7．6．2 Case Study：CRAY T3E512

7．6．3 Summary513

7．7 Clusters and Networks of Workstations513

7．7．1 Case Study：Myrinet SBUS Lanai516

7．7．2 Case Study：PCI Memory Channel518

7．8 Implications for Parallel Software522

7．8．1 Network Transaction Performance522

7．8．2 Shared Address Space Operations527

7．8．3 Message-Passing Operations528

7．8．4 Application-Level Performance531

7．9．1 Algorithms for Locks538

7．9 Synchronization538

7．9．2 Algorithms for Barriers542

7．10 Concluding Remarks548

7．11 Exercises548

8 Directory-Based Cache Coherence553

8．1 Scalable Cache Coherence558

8．2 Overview of Directory-Based Approaches559

8．2．1 Operation of a Simple Directory Scheme560

8．2．2 Scaling564

8．2．3 Alternatives for Organizing Directories565

8．3．1 Data Sharing Patterns for Directory Schemes571

8．3 Assessing Directory Protocols and Trade-Offs571

8．3．2 Local versus Remote Traffic578

8．3．3 Cache Block Size Effects579

8．4 Design Challenges for Directory Protocols579

8．4．1 Performance584

8．4．2 Correctness589

8．5 Memory-Based Directory Protocols：The SGI Origin System596

8．5．1 Cache Coherence Protocol597

8．5．2 Dealing with Correctness Issues604

8．5．3 Details of Directory Structure609

8．5．4 Protocol Extensions610

8．5．5 Overview of the Origin2000 Hardware612

8．5．6 Hub Implementation614

8．5．7 Performance Characteristics618

8．6 Cache-Based Directory Protocols：The Sequent NUMA-Q622

8．6．1 Cache Coherence Protocol624

8．6．2 Dealing with Correctness Issues632

8．6．3 Protocol Extensions634

8．6．4 Overview of NUMA-Q Hardware635

8．6．5 Protocol Interaction with SMP Node637

8．6．6 IQ-Link Implementation639

8．6．7 Performance Characteristics641

8．6．8 Comparison Case Study：The HAL S1 Multiprocessor643

8．7 Performance Parameters and Protocol Performance645

8．8 Synchronization648

8．8．1 Performance of Synchronization Algorithms649

8．8．2 Implementing Atomic Primitives651

8．9 Implications for Parallel Software652

8．10 Advanced Topics655

8．10．1 Reducing Directory Storage Overhead655

8．10．2 Hierarchical Coherence659

8．11 Concluding Remarks669

8．12 Exercises672

9 Hardware/Software Trade-Offs679

9．1 Relaxed Memory Consistency Models681

9．1．1 The System Specification686

9．1．2 The Programmer s Interface694

9．1．3 The Translation Mechanism698

9．1．4 Consistency Models in Real Multiprocessor Systems698

9．2 Overcoming Capacity Limitations700

9．2．1 Tertiary Caches700

9．2．2 Cache-Only Memory Architectures（COMA）701

9．3 Reducing Hardware Cost705

9．3．1 Hardware Access Control with a Decoupled Assist707

9．3．2 Access Control through Code Instrumentation707

9．3．3 Page-Based Access Control：Shared Virtual Memory709

9．3．4 Access Control through Language and Compiler Support721

9．4 Putting It All Together：A Taxonomy and Simple COMA724

9．4．1 Putting It All Together：Simple COMA and Stache726

9．5 Implications for Parallel Software729

9．6 Advanced Topics730

9．6．1 Flexibility and Address Constraints in CC-NUMA Systems730

9．6．2 Implementing Relaxed Memory Consistency in Software732

9．7 Concluding Remarks739

9．8 Exercises740

10 Interconnection Network Design749

10．1 Basic Definitions750

10．2 Basic Communication Performance755

10．2．1 Latency755

10．2．2 Bandwidth761

10．3 Organizational Structure764

10．3．1 Links764

10．3．2 Swithches767

10．3．3 Network Interfaces768

10．4 Interconnection Topologies768

10．4．1 Fully Connected Network768

10．4．2 Linear Arrays and Rings769

10．4．3 Multidimensional Meshes and Tori769

10．4．4 Trees772

10．4．5 Butterflies774

10．4．6 Hypercubes778

10．5 Evaluating Design Trade-Offs in Network Topology779

10．5．1 Unloaded Latency780

10．5．2 Latency under Load785

10．6 Routing789

10．6．1 Routing Mechanisms789

10．6．2 Deterministic Routing790

10．6．3 Deadlock Freedom791

10．6．4 Virtual Channels795

10．6．5 Up*-Down*Routing796

10．6．6 Turn-Model Routing797

10．6．7 Adaptive Routing799

10．7 Switch Design801

10．7．2 Internal Datapath802

10．7．1 Ports802

10．7．3 Channel Buffers804

10．7．4 Output Scheduling808

10．7．5 Stacked Dimension Switches810

10．8 Flow Control811

10．8．1 Parallel Computer Networks versus LANs and WANs811

10．8．2 Link-Level Flow Control813

10．8．3 End-to-End Flow Control816

10．9 Case Studies818

10．9．1 CRAY T3D Network818

10．9．2 IBM SP-1，SP-2 Network820

10．9．3 Scalable Coherent Interface822

10．9．4 SGI Origin Network825

10．9．5 Myricom Network826

10．10 Concluding Remarks827

10．11 Exercises828

11 Latency Tolerance831

11．1 Overview of Latency Tolerance834

11．1．1 Latency Tolerance and the Communication Pipeline836

11．1．2 Approaches837

11．1．3 Fundamental Requirements，Benefits，and Limitations840

11．2 Latency Tolerance in Explicit Message Passing847

11．2．3 Precommunication848

11．2．1 Structure of Communication848

11．2．2 Block Data Transfer848

11．2．4 Proceeding Past Communication in the Same Thread850

11．2．5 Multithreading850

11．3 Latency Tolerance in a Shared Address Space851

11．3．1 Structure of Communication852

11．4 Block Data Transfer in a Shared Address Space853

11．4．1 Techniques and Mechanisms853

11．4．2 Policy Issues and Trade-Offs854

11．4．3 Performance Benefits856

11．5 Proceeding Past Long-Latency Events863

11．5．1 Proceeding Past Writes864

11．5．2 Proceeding Past Reads868

11．5．3 Summary876

11．6 Precommunication in a Shared Address Space877

11．6．1 Shared Address Space without Caching of Shared Data877

11．6．2 Cache-Coherent Shared Address Space879

11．6．3 Performance Benefits891

11．6．4 Summary896

11．7 Multithreading in a Shared Address Space896

11．7．1 Techniques and Mechanisms898

11．7．2 Performance Benefits910

11．7．3 Implementation Issues for the Blocked Scheme914

11．7．4 Implementation Issues for the Interleaved Scheme917

11．7．5 Integrating Multithreading with Multiple-Issue Processors920

11．8 Lockup-Free Cache Design922

11．9 Concluding Remarks926

11．10 Exercises927

12 Future Directions935

12．1 Technology and Architecture936

12．1．1 Evolutionary Scenario937

12．1．2 Hitting a Wall940

12．1．3 Potential Breakthroughs944

12．2．1 Evolutionary Scenario955

12．2 Applications and System Software955

12．2．2 Hitting a Wall960

12．2．3 Potential Breakthroughs961

Appendix：Parallel Benchmark Suites963

A．1 ScaLapack963

A．2 TPC963

A．3 SPLASH965

A．4 NAS Parallel Benchmarks966

A．5 PARKBENCH967

A．6 Other Ongoing Efforts968

References969

Index993

1999《并行计算机体系结构英文版·第2版》由于是年代较久的资料都绝版了，几乎不可能购买到实物。如果大家为了学习确实需要，可向博主求助其电子版PDF文件（由（美）（D.E.卡勒）David E.Culler等著 1999 北京：机械工业出版社出版的版本）。对合法合规的求助，我会当即受理并将下载地址发送给你。

系统维护中...