High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches 1st Edition by Jim Jeffers, James Reinders – Ebook PDF Instant Download/Delivery: 012803890X, 9780128038901
Full download High Performance Parallelism Pearls Volume Two 1st Edition after payment
Product details:
ISBN 10: 012803890X
ISBN 13: 9780128038901
Author: Jim Jeffers, James Reinders
High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming – illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating “success stories” demonstrating not just the features of Xeon-powered systems, but also how to leverage parallelism across these heterogeneous systems.
High Performance Parallelism Pearls Volume Two 1st Table of contents:
Chapter 1: Introduction
Abstract
Applications and techniques
SIMD and vectorization
OpenMP and nested parallelism
Latency optimizations
Python
Streams
Ray tracing
Tuning prefetching
MPI shared memory
Using every last core
OpenCL vs. OpenMP
Power analysis for nodes and clusters
The future of many-core
Downloads
Chapter 2: Numerical Weather Prediction Optimization
Abstract
Numerical weather prediction: Background and motivation
WSM6 in the NIM
Shared-memory parallelism and controlling horizontal vector length
Array alignment
Loop restructuring
Compile-time constants for loop and array bounds
Performance improvements
Summary
Chapter 3: WRF Goddard Microphysics Scheme Optimization
Abstract
Acknowledgments
The motivation and background
WRF Goddard microphysics scheme
Summary
Chapter 4: Pairwise DNA Sequence Alignment Optimization
Abstract
Pairwise sequence alignment
Parallelization on a single coprocessor
Parallelization across multiple coprocessors using MPI
Performance results
Summary
Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery
Abstract
Parallelism enables proteome-scale structural bioinformatics
Overview of eFindSite
Benchmarking dataset
Code profiling
Porting eFindSite for coprocessor offload
Parallel version for a multicore processor
Task-level scheduling for processor and coprocessor
Case study
Summary
Chapter 6: Amber PME Molecular Dynamics Optimization
Abstract
Theory of MD
Acceleration of neighbor list building using the coprocessor
Acceleration of direct space sum using the coprocessor
Additional optimizations in coprocessor code
Modification of load balance algorithm
Compiler optimization flags
Results
Conclusions
Chapter 7: Low-Latency Solutions for Financial Services Applications
Abstract
Introduction
The opportunity
Packet processing architecture
The symmetric communication interface
Optimizing packet processing on the coprocessor
Results
Conclusions
Chapter 8: Parallel Numerical Methods in Finance
Abstract
Overview
Introduction
Pricing equation for American option
Initial C/C++ implementation
Scalar optimization: Your best first step
SIMD parallelism—Vectorization
Thread parallelization
Scale from multicore to many-core
Summary
For more information
Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization
Abstract
The Wilson-Dslash kernel
First implementation and performance
Optimized code: QPhiX and QphiX-Codegen
Code generation with QphiX-Codegen
Performance results for QPhiX
The end of the road?
Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism in Practice
Abstract
Analyzing the CMB with Modal
Optimization and modernization
Introducing nested parallelism
Results
Summary
Chapter 11: Visual Search Optimization
Abstract
Image-matching application
Image acquisition and processing
Keypoint matching
Applications
A study of parallelism in the visual search application
Database (db) level parallelism
Flann library parallelism
Experimental evaluation
Setup
Database threads scaling
Flann threads scaling
KD-tree scaling with dbthreads
Summary
Chapter 12: Radio Frequency Ray Tracing
Abstract
Acknowledgments
Background
StingRay system architecture
Optimization examples
Summary
Chapter 13: Exploring Use of the Reserved Core
Abstract
Acknowledgments
The Uintah computational framework
Cross-compiling the UCF
Toward demystifying the reserved core
Experimental discussion
Summary
Chapter 14: High Performance Python Offloading
Abstract
Acknowledgments
Background
The pyMIC offload module
Example: singular value decomposition
GPAW
PyFR
Performance
Summary
Chapter 15: Fast Matrix Computations on Heterogeneous Streams
Abstract
The challenge of heterogeneous computing
Matrix multiply
The hStreams library and framework
Cholesky factorization
LU factorization
Continuing work on hStreams
Acknowledgments
Recap
Summary
Tiled hStreams matrix multiplier example source
Chapter 16: MPI-3 Shared Memory Programming Introduction
Abstract
Motivation
MPI’s interprocess shared memory extension
When to use MPI interprocess shared memory
1-D ring: from MPI messaging to shared memory
Modifying MPPTEST halo exchange to include MPI SHM
Evaluation environment and results
Summary
Chapter 17: Coarse-Grained OpenMP for Scalable Hybrid Parallelism
Abstract
Coarse-grained versus fine-grained parallelism
Flesh on the bones: A FORTRAN “stencil-test” example
Performance results with the stencil code
Parallelism in numerical weather prediction models
Summary
Chapter 18: Exploiting Multilevel Parallelism in Quantum Simulations
Abstract
Science: better approximate solutions
About the reference application
Parallelism in ES applications
Multicore and many-core architectures for quantum simulations
Setting up experiments
User code experiments
Summary: try multilevel parallelism in your applications
Chapter 19: OpenCL: There and Back Again
Abstract
Acknowledgments
The GPU-HEOM application
The Hexciton kernel
Optimizing the OpenCL Hexciton kernel
Performance portability in OpenCL
Porting the OpenCL kernel to OpenMP 4.0
Summary
Chapter 20: OpenMP Versus OpenCL: Difference in Performance?
Abstract
Five benchmarks
Experimental setup and time measurements
HotSpot benchmark optimization
Optimization steps for the other four benchmarks
Summary
Chapter 21: Prefetch Tuning Optimizations
Abstract
Acknowledgments
The importance of prefetching for performance
Prefetching on Intel Xeon Phi coprocessors
Throughput applications
Tuning prefetching
Results—Prefetch tuning examples on a coprocessor
Results—Tuning hardware prefetching on a processor
Summary
Chapter 22: SIMD Functions Via OpenMP
Abstract
SIMD vectorization overview
Directive guided vectorization
Targeting specific architectures
Vector functions in C++
Vector functions in Fortran
Performance results
Summary
Chapter 23: Vectorization Advice
Abstract
The importance of vectorization
About DL_MESO LBE
Intel vectorization advisor and the underlying technology
Analyzing the Lattice Boltzmann code
Summary
Chapter 24: Portable Explicit Vectorization Intrinsics
Abstract
Acknowledgments
Related work
Why vectorization?
Portable vectorization with OpenVec
Real-world example
Performance results
Developing toward the future
Summary
Chapter 25: Power Analysis for Applications and Data Centers
Abstract
Introduction to measuring and saving power
Application: Power measurement and analysis
Data center: Interpretation via waterfall power data charts
People also search for High Performance Parallelism Pearls Volume Two 1st:
parallel programming techniques
multicore CPU programming
many-core optimization
high performance computing guide
parallelism pearls
Tags: High Performance, Parallelism, Multicore, Programming Approaches, Jim Jeffers, James Reinders



