<

High Performance Computing

Download for free at http://cnx.org/contents/bb821554-7f76-44b1-89e7-8a2a759d1347@5.2

Table of contents
Chapter One. Modern Computer Architectures
1.1 - Memory
1.1.1 - Introduction
1.1.2 - Memory Technology
1.1.3 - Registers
1.1.4 - Caches
1.1.5 - Cache Organization
1.1.6 - Virtual Memory
1.1.7 - Improving Memory Performance
1.1.8 - Closing Notes
1.1.9 - Exercises
1.2 - Floating-Point Numbers
1.2.1 - Introduction
1.2.2 - Reality
1.2.3 - Representation
1.2.4 - Effects of Floating-Point Representation
1.2.5 - More Algebra That Doesn't Work
1.2.6 - Improving Accuracy Using Guard Digits
1.2.7 - History of IEEE Floating-Point Format
1.2.8 - IEEE Operations
1.2.9 - Special Values
1.2.10 - Exceptions and Traps
1.2.11 - Compiler Issues
1.2.12 - Closing Notes
1.2.13 - Exercises
Chapter Two. Programming and Tuning Software
2.1 - What a Compiler Does
2.1.1 - Introduction
2.1.2 - History of Compilers
2.1.3 - Which Language To Optimize
2.1.4 - Optimizing Compiler Tour
2.1.5 - Optimization Levels
2.1.6 - Classical Optimizations
2.1.7 - Closing Notes
2.1.8 - Exercises
2.2 - Timing and Profiling
2.2.1 - Introduction
2.2.2 - Timing
2.2.3 - Subroutine Profiling
2.2.4 - Basic Block Profilers
2.2.5 - Virtual Memory
2.2.6 - Closing Notes
2.2.7 - Exercises
2.3 - Eliminating Clutter
2.3.1 - Introduction
2.3.2 - Subroutine Calls
2.3.3 - Branches
2.3.4 - Branches With Loops
2.3.5 - Other Clutter
2.3.6 - Closing Notes
2.3.7 - Exercises
2.4 - Loop Optimizations
2.4.1 - Introduction
2.4.2 - Operation Counting
2.4.3 - Basic Loop Unrolling
2.4.4 - Qualifying Candidates for Loop Unrolling Up one level
2.4.5 - Nested Loops
2.4.6 - Loop Interchange
2.4.7 - Memory Access Patterns
2.4.8 - When Interchange Won't Work
2.4.9 - Blocking to Ease Memory Access Patterns
2.4.10 - Programs That Require More Memory Than You Have
2.4.11 - Closing Notes
2.4.12 - Exercises
Chapter Three. Shared-Memory Parallel Processors
3.1 - Understanding Parallelism
3.1.1 - Introduction
3.1.2 - Dependencies
3.1.3 - Loops
3.1.4 - Loop-Carried Dependencies
3.1.5 - Ambiguous References
3.1.6 - Closing Notes
3.1.7 - Exercises
3.2 - Shared-Memory Multiprocessors
3.2.1 - Introduction
3.2.2 - Symmetric Multiprocessing Hardware
3.2.3 - Multiprocessor Software Concepts
3.2.4 - Techniques for Multithreaded Programs
3.2.5 - A Real Example
3.2.6 - Closing Notes
3.2.7 - Exercises
3.3 - Programming Shared-Memory Multiprocessors
3.3.1 - Introduction
3.3.2 - Automatic Parallelization
3.3.3 - Assisting the Compiler
3.3.4 - Closing Notes
3.3.5 - Exercises
Chapter Four. Scalable Parallel Processing
4.1 - Language Support for Performance
4.1.1 - Introduction
4.1.2 - Data-Parallel Problem: Heat Flow
4.1.3 - Explicity Parallel Languages
4.1.4 - FORTRAN 90
4.1.5 - Problem Decomposition
4.1.6 - High Performance FORTRAN (HPF)
4.1.7 - Closing Notes
4.2 - Message-Passing Environments
4.2.1 - Introduction
4.2.2 - Parallel Virtual Machine
4.2.3 - Message-Passing Interface
4.2.4 - Closing Notes
Chapter Five. Appendixes
5.1 - Appendix C: High Performance Microprocessors
5.1.1 - Introduction
5.1.2 - Why CISC?
5.1.3 - Fundamental of RISC
5.1.4 - Second-Generation RISC Processors
5.1.5 - RISC Means Fast
5.1.6 - Out-of-Order Execution: The Post-RISC Architecture
5.1.7 - Closing Notes
5.1.8 - Exercises
5.2 - Appendix B: Looking at Assembly Language
5.2.1 - Assembly Language
Chapter Six. Attributions
High Performance Computing
1st Edition
Charles Severance, Kevin Dowd
© 25 aug. 2010 Charles SeveranceKevin Dowd. Textbook content produced by Charles Severance,Kevin Dowd is licensed under a Creative Commons Attribution License 3.0 license.
Table Of Contents
  • Introduction - High Performance Computing
  • Chapter One - Modern Computer Architectures
    • 1.1 - Memory
      • 1.1.1 - Introduction
      • 1.1.2 - Memory Technology
      • 1.1.3 - Registers
      • 1.1.4 - Caches
      • 1.1.5 - Cache Organization
      • 1.1.6 - Virtual Memory
      • 1.1.7 - Improving Memory Performance
      • 1.1.8 - Closing Notes
      • 1.1.9 - Exercises
    • 1.2 - Floating-Point Numbers
      • 1.2.1 - Introduction
      • 1.2.2 - Reality
      • 1.2.3 - Representation
      • 1.2.4 - Effects of Floating-Point Representation
      • 1.2.5 - More Algebra That Doesn't Work
      • 1.2.6 - Improving Accuracy Using Guard Digits
      • 1.2.7 - History of IEEE Floating-Point Format
      • 1.2.8 - IEEE Operations
      • 1.2.9 - Special Values
      • 1.2.10 - Exceptions and Traps
      • 1.2.11 - Compiler Issues
      • 1.2.12 - Closing Notes
      • 1.2.13 - Exercises
  • Chapter Two - Programming and Tuning Software
    • 2.1 - What a Compiler Does
      • 2.1.1 - Introduction
      • 2.1.2 - History of Compilers
      • 2.1.3 - Which Language To Optimize
      • 2.1.4 - Optimizing Compiler Tour
      • 2.1.5 - Optimization Levels
      • 2.1.6 - Classical Optimizations
      • 2.1.7 - Closing Notes
      • 2.1.8 - Exercises
    • 2.2 - Timing and Profiling
      • 2.2.1 - Introduction
      • 2.2.2 - Timing
      • 2.2.3 - Subroutine Profiling
      • 2.2.4 - Basic Block Profilers
      • 2.2.5 - Virtual Memory
      • 2.2.6 - Closing Notes
      • 2.2.7 - Exercises
    • 2.3 - Eliminating Clutter
      • 2.3.1 - Introduction
      • 2.3.2 - Subroutine Calls
      • 2.3.3 - Branches
      • 2.3.4 - Branches With Loops
      • 2.3.5 - Other Clutter
      • 2.3.6 - Closing Notes
      • 2.3.7 - Exercises
    • 2.4 - Loop Optimizations
      • 2.4.1 - Introduction
      • 2.4.2 - Operation Counting
      • 2.4.3 - Basic Loop Unrolling
      • 2.4.4 - Qualifying Candidates for Loop Unrolling Up one level
      • 2.4.5 - Nested Loops
      • 2.4.6 - Loop Interchange
      • 2.4.7 - Memory Access Patterns
      • 2.4.8 - When Interchange Won't Work
      • 2.4.9 - Blocking to Ease Memory Access Patterns
      • 2.4.10 - Programs That Require More Memory Than You Have
      • 2.4.11 - Closing Notes
      • 2.4.12 - Exercises
  • Chapter Three - Shared-Memory Parallel Processors
    • 3.1 - Understanding Parallelism
      • 3.1.1 - Introduction
      • 3.1.2 - Dependencies
      • 3.1.3 - Loops
      • 3.1.4 - Loop-Carried Dependencies
      • 3.1.5 - Ambiguous References
      • 3.1.6 - Closing Notes
      • 3.1.7 - Exercises
    • 3.2 - Shared-Memory Multiprocessors
      • 3.2.1 - Introduction
      • 3.2.2 - Symmetric Multiprocessing Hardware
      • 3.2.3 - Multiprocessor Software Concepts
      • 3.2.4 - Techniques for Multithreaded Programs
      • 3.2.5 - A Real Example
      • 3.2.6 - Closing Notes
      • 3.2.7 - Exercises
    • 3.3 - Programming Shared-Memory Multiprocessors
      • 3.3.1 - Introduction
      • 3.3.2 - Automatic Parallelization
      • 3.3.3 - Assisting the Compiler
      • 3.3.4 - Closing Notes
      • 3.3.5 - Exercises
  • Chapter Four - Scalable Parallel Processing
    • 4.1 - Language Support for Performance
      • 4.1.1 - Introduction
      • 4.1.2 - Data-Parallel Problem: Heat Flow
      • 4.1.3 - Explicity Parallel Languages
      • 4.1.4 - FORTRAN 90
      • 4.1.5 - Problem Decomposition
      • 4.1.6 - High Performance FORTRAN (HPF)
      • 4.1.7 - Closing Notes
    • 4.2 - Message-Passing Environments
      • 4.2.1 - Introduction
      • 4.2.2 - Parallel Virtual Machine
      • 4.2.3 - Message-Passing Interface
      • 4.2.4 - Closing Notes
  • Chapter Five - Appendixes
    • 5.1 - Appendix C: High Performance Microprocessors
      • 5.1.1 - Introduction
      • 5.1.2 - Why CISC?
      • 5.1.3 - Fundamental of RISC
      • 5.1.4 - Second-Generation RISC Processors
      • 5.1.5 - RISC Means Fast
      • 5.1.6 - Out-of-Order Execution: The Post-RISC Architecture
      • 5.1.7 - Closing Notes
      • 5.1.8 - Exercises
    • 5.2 - Appendix B: Looking at Assembly Language
      • 5.2.1 - Assembly Language
  • Chapter Six - Attributions
Introduction
High Performance Computing

By:

Charles Severance

Kevin Dowd

 

 

 

Online:

< http://cnx.org/content/col11136/1.5/ >

 

 

 

C O N N E X I O N S

Rice University, Houston, Texas

 

 

This selection and arrangement of content as a collection is copyrighted by Charles Severance, Kevin Dowd. It is licensed under the Creative Commons Attribution 3.0 license (http://creativecommons.org/licenses/by/3.0/).

Collection structure revised: August 25, 2010

For copyright and attribution information for the modules contained in this collection, see p. 244.

Introduction.1. Introduction to the Connexions Edition

Introduction to the Connexions Edition

The purpose of this book has always been to teach new programmers and scientists about the basics of High Performance Computing. Too many parallel and high performance computing books focus on the architecture, theory and computer science surrounding HPC. I wanted this book to speak to the practicing Chemistry student, Physicist, or Biologist who need to write and run their programs as part of their research. I was using the first edition of the book written by Kevin Dowd in 1996 when I found out that the book was going out of print. I immediately sent an angry letter to O'Reilly customer support imploring them to keep the book going as it was the only book of its kind in the marketplace. That complaint letter triggered several conversations which let to me becoming the author of the second edition. In true "open-source" fashion - since I complained about it - I got to fix it. During Fall 1997, while I was using the book to teach my HPC course, I re-wrote the book one chapter at a time, fueled by multiple late-night lattes and the fear of not having anything ready for the weeks lecture.

The second edition came out in July 1998, and was pretty well received. I got many good comments from teachers and scientists who felt that the book did a good job of teaching the practitioner - which made me very happy.

In 1998, this book was published at a crossroads in the history of High Performance Computing. In the late 1990's there was still a question a to whether the large vector supercomputers with their specialized memory systems could resist the assault from the increasing clock rates of the microprocessors. Also in the later 1990's there was a question whether the fast, expensive, and power-hungry RISC architectures would win over the commodity Intel microprocessors and commodity memory technologies.

By 2003, the market had decided that the commodity microprocessor was king - its performance and the performance of commodity memory subsystems kept increasing so rapidly. By 2006, the Intel architecture had eliminated all the RISC architecture processors by greatly increasing clock rate and truly winning the increasingly important Floating Point Operations per Watt competition. Once users figured out how to effectively use loosely coupled processors, overall cost and improving energy consumption of commodity microprocessors became overriding factors in the market place.

These changes led to the book becoming less and less relevant to the common use cases in the HPC field and led to the book going out of print - much to the chagrin of its small but devoted fan base. I was reduced to buying used copies of the book from Amazon in order to have a few copies laying around the office to give as gifts to unsuspecting visitors.

Thanks the the forward-looking approach of O'Reilly and Associates to use Founder's Copyright and releasing out-of-print books under Creative Commons Attribution, this book once again rises from the ashes like the proverbial Phoenix. By bringing this book to Connexions and publishing it under a Creative Commons Attribution license we are insuring that the book is never again obsolete. We can take the core elements of the book which are still relevant and a new community of authors can add to and adapt the book as needed over time.

Publishing through Connexions also keeps the cost of printed books very low and so it will be a wise choice as a textbook for college courses in High Performance Computing. The Creative Commons Licensing and the ability to print locally can make this book available in any country and any school in the world. Like Wikipedia, those of us who use the book can become the volunteers who will help improve the book and become co-authors of the book.

I need to thank Kevin Dowd who wrote the first edition and graciously let me alter it from cover to cover in the second edition. Mike Loukides of O'Reilly was the editor of both the first and second editions and we talk from time to time about a possible future edition of the book. Mike was also instrumental in helping to release the book from O'Reilly under Creative Commons Attribution. The team at Connexions has been wonderful to work with. We share a passion for High Performance Computing and new forms of publishing so that the knowledge reaches as many people as possible. I want to thank Jan Odegard and Kathi Fletcher for encouraging, supporting and helping me through the re-publishing process. Daniel Williamson did an amazing job of converting the materials from the O'Reilly formats to the Connexions formats.

I truly look forward to seeing how far this book will go now that we can have an unlimited number of co-authors to invest and then use the book. I look forward to work with you all.

Charles Severance - November 12, 2009

Introduction.2. Introduction to High Performance Computing
Introduction.2.1. Why Worry About Performance?

Over the last decade, the definition of what is called high performance computing has changed dramatically. In 1988, an article appeared in the Wall Street Journal titled “Attack of the Killer Micros” that described how computing systems made up of many small inexpensive processors would soon make large supercomputers obsolete. At that time, a “personal computer” costing $3000 could perform 0.25 million floating-point operations per second, a “workstation” costing $20,000 could perform 3 million floating-point operations, and a supercomputer costing $3 million could perform 100 million floating-point operations per second. Therefore, why couldn’t we simply connect 400 personal computers together to achieve the same performance of a supercomputer for $1.2 million?

This vision has come true in some ways, but not in the way the original proponents of the “killer micro” theory envisioned. Instead, the microprocessor performance has relentlessly gained on the supercomputer performance. This has occurred for two reasons. First, there was much more technology “headroom” for improving performance in the personal computer area, whereas the supercomputers of the late 1980s were pushing the performance envelope. Also, once the supercomputer companies broke through some technical barrier, the microprocessor companies could quickly adopt the successful elements of the supercomputer designs a few short years later. The second and perhaps more important factor was the emergence of a thriving personal and business computer market with ever-increasing performance demands. Computer usage such as 3D graphics, graphical user interfaces, multimedia, and games were the driving factors in this market. With such a large market, available research dollars poured into developing inexpensive high performance processors for the home market. The result of this trend toward faster smaller computers is directly evident as former supercomputer manufacturers are being purchased by workstation companies (Silicon Graphics purchased Cray, and Hewlett-Packard purchased Convex in 1996).

As a result nearly every person with computer access has some “high performance” processing. As the peak speeds of these new personal computers increase, these computers encounter all the performance challenges typically found on supercomputers.

While not all users of personal workstations need to know the intimate details of high performance computing, those who program these systems for maximum performance will benefit from an understanding of the strengths and weaknesses of these newest high performance systems.

Introduction.2.2. Scope of High Performance Computing

High performance computing runs a broad range of systems, from our desktop computers through large parallel processing systems. Because most high performance systems are based on reduced instruction set computer (RISC) processors, many techniques learned on one type of system transfer to the other systems.

High performance RISC processors are designed to be easily inserted into a multiple-processor system with 2 to 64 CPUs accessing a single memory using symmetric multi processing (SMP). Programming multiple processors to solve a single problem adds its own set of additional challenges for the programmer. The programmer must be aware of how multiple processors operate together, and how work can be efficiently divided among those processors.

Even though each processor is very powerful, and small numbers of processors can be put into a single enclosure, often there will be applications that are so large they need to span multiple enclosures. In order to cooperate to solve the larger application, these enclosures are linked with a high-speed network to function as a network of workstations (NOW). A NOW can be used individually through a batch queuing system or can be used as a large multicomputer using a message passing tool such as parallel virtual machine (PVM) or message-passing interface (MPI).

For the largest problems with more data interactions and those users with compute budgets in the millions of dollars, there is still the top end of the high performance computing spectrum, the scalable parallel processing systems with hundreds to thousands of processors. These systems come in two flavors. One type is programmed using message passing. Instead of using a standard local area network, these systems are connected using a proprietary, scalable, high-bandwidth, low-latency interconnect (how is that for marketing speak?). Because of the high performance interconnect, these systems can scale to the thousands of processors while keeping the time spent (wasted) performing overhead communications to a minimum.

The second type of large parallel processing system is the scalable non-uniform memory access (NUMA) systems. These systems also use a high performance inter-connect to connect the processors, but instead of exchanging messages, these systems use the interconnect to implement a distributed shared memory that can be accessed from any processor using a load/store paradigm. This is similar to programming SMP systems except that some areas of memory have slower access than others.

Introduction.2.3. Studying High Performance Computing

The study of high performance computing is an excellent chance to revisit computer architecture. Once we set out on the quest to wring the last bit of performance from our computer systems, we become more motivated to fully understand the aspects of computer architecture that have a direct impact on the system’s performance.

Throughout all of computer history, salespeople have told us that their compiler will solve all of our problems, and that the compiler writers can get the absolute best performance from their hardware. This claim has never been, and probably never will be, completely true. The ability of the compiler to deliver the peak performance available in the hardware improves with each succeeding generation of hardware and software. However, as we move up the hierarchy of high performance computing architectures we can depend on the compiler less and less, and programmers must take responsibility for the performance of their code.

In the single processor and SMP systems with few CPUs, one of our goals as programmers should be to stay out of the way of the compiler. Often constructs used to improve performance on a particular architecture limit our ability to achieve performance on another architecture. Further, these “brilliant” (read obtuse) hand optimizations often confuse a compiler, limiting its ability to automatically transform our code to take advantage of the particular strengths of the computer architecture.

As programmers, it is important to know how the compiler works so we can know when to help it out and when to leave it alone. We also must be aware that as compilers improve (never as much as salespeople claim) it’s best to leave more and more to the compiler.

As we move up the hierarchy of high performance computers, we need to learn new techniques to map our programs onto these architectures, including language extensions, library calls, and compiler directives. As we use these features, our programs become less portable. Also, using these higher-level constructs, we must not make modifications that result in poor performance on the individual RISC microprocessors that often make up the parallel processing system.

Introduction.2.4. Measuring Performance

When a computer is being purchased for computationally intensive applications, it is important to determine how well the system will actually perform this function. One way to choose among a set of competing systems is to have each vendor loan you a system for a period of time to test your applications. At the end of the evaluation period, you could send back the systems that did not make the grade and pay for your favorite system. Unfortunately, most vendors won’t lend you a system for such an extended period of time unless there is some assurance you will eventually purchase the system.

More often we evaluate the system’s potential performance using benchmarks. There are industry benchmarks and your own locally developed benchmarks. Both types of benchmarks require some careful thought and planning for them to be an effective tool in determining the best system for your application.

Introduction.2.5. The Next Step

Quite aside from economics, computer performance is a fascinating and challenging subject. Computer architecture is interesting in its own right and a topic that any computer professional should be comfortable with. Getting the last bit of per- formance out of an important application can be a stimulating exercise, in addition to an economic necessity. There are probably a few people who simply enjoy matching wits with a clever computer architecture.

What do you need to get into the game?

  • A basic understanding of modern computer architecture. You don’t need an advanced degree in computer engineering, but you do need to understand the basic terminology.
  • A basic understanding of benchmarking, or performance measurement, so you can quantify your own successes and failures and use that information to improve the performance of your application.

This book is intended to be an easily understood introduction and overview of high performance computing. It is an interesting field, and one that will become more important as we make even greater demands on our most common personal computers. In the high performance computer field, there is always a tradeoff between the single CPU performance and the performance of a multiple processor system. Multiple processor systems are generally more expensive and difficult to program (unless you have this book).

Some people claim we eventually will have single CPUs so fast we won’t need to understand any type of advanced architectures that require some skill to program.

So far in this field of computing, even as performance of a single inexpensive microprocessor has increased over a thousandfold, there seems to be no less interest in lashing a thousand of these processors together to get a millionfold increase in power. The cheaper the building blocks of high performance computing become, the greater the benefit for using many processors. If at some point in the future, we have a single processor that is faster than any of the 512-processor scalable systems of today, think how much we could do when we connect 512 of those new processors together in a single system.

That’s what this book is all about. If you’re interested, read on.

Purchase this material to get access to the full version

Add to cart

Showing only first chapter for book preview.