Education
Publications and Reports

PUMAV: Optimizing Parallel Code Performance Through Interactive Visualization
Eric Papenhause, M. Harper Langston, Benoit Meister, Richard Lethin, Klaus Mueller
2018 IEEE Computer Graphics and Applications [Feature Article]
Performance optimization for parallel, looporiented
programs compromises between parallelism and
locality. We present a visualization interface which
allows programmers to assist the compiler in
generating optimal code. It greatly improves the
user’s understanding of the transformations that took
place and aids in making additional transformations in
a visually intuitive way.


Topic Modeling for Analysis of Big Data Tensor Decompositions
Thomas Henretty, M. Harper Langston, Muthu Baskaran, James Ezick, Richard Lethin
SPIE Proceedings Volume 10652, Disruptive Technologies in Information Sciences; 1065208 (2018), doi: 10.1117/12.2306933
Tensor decompositions are a class of algorithms used for unsupervised pattern discovery. Structured, multidimensional datasets are encoded as tensors and decomposed into discrete, coherent patterns captured as weighted collections of highdimensional vectors known as components. Tensor decompositions have recently shown promising results when addressing problems related to data comprehension and anomaly discovery in cybersecurity and intelligence analysis. However, analysis of Big Data tensor decompositions is currently a critical bottleneck owing to the volume and variety of unlabeled patterns that are produced. We present an approach to automated component clustering and classification based on the Latent Dirichlet Allocation (LDA) topic modeling technique and show example applications to representative cybersecurity and geospatial datasets.v

 Adaptive
Inhomogeneous PDE Solvers for Complex Geometries
H. Langston, L. Greengard, and D. Zorin
In Preparation, 2016


Memoryefficient Parallel Tensor Decompositions
Muthu Baskaran, Tom Henretty, Benoit Pradelle, M. Harper Langston, David BrunsSmith, James Ezick, Richard Lethin
2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. [Best Paper Award Winner]
Tensor decompositions are a powerful technique for enabling comprehensive and complete analysis of realworld data. Data analysis through tensor decompositions involves intensive computations over largescale irregular sparse data. Optimizing the execution of such data intensive computations is key to reducing the timetosolution (or response time) in realworld data analysis applications. As highperformance computing (HPC) systems are increasingly used for data analysis applications, it is becoming increasingly important to optimize sparse tensor computations and execute them efficiently on modern and advanced HPC systems. In addition to utilizing the large processing capability of HPC systems, it is crucial to improve memory performance (memory usage, communication, synchronization, memory reuse, and data locality) in HPC systems.
In this paper, we present multiple optimizations that are targeted towards faster and memoryefficient execution of largescale tensor analysis on HPC systems. We demonstrate that our techniques achieve reduction in memory usage and execution time of tensor decomposition methods when they are applied on multiple datasets of varied size and structure from different application domains. We achieve up to 11x reduction in memory usage and up to 7x improvement in performance. More importantly, we enable the application of large tensor decompositions on some important datasets on a multicore system that would not have been feasible without our optimization.


Discovering Deep Patterns In Large Scale Network Flows Using Tensor Decompositions
James Ezick, Muthu Baskaran, David BrunsSmith, Alan Commike, Thomas Henretty, M. Harper Langston, Jordi RosGiralt, Richard Lethin
FloCon 2017, San Diego, CA
We present an approach to a cyber security workflow based on ENSIGN, a highperformance implementation of tensor decomposition algorithms that enable the unsupervised discovery of subtle undercurrents and deep, crossdimensional correlations within multidimensional data. This new approach of starting from identified patterns in the data complements traditional workflows that focus on highlighting individual suspicious activities. This enhanced workflow assists in identifying attackers who craft their actions to
subvert signaturebased detection methods and automates much of the labor intensive forensic process of connecting isolated incidents into a coherent attack profile.

 Adaptive
Inhomogeneous PDE Solvers for Complex Geometries
H. Langston, L. Greengard, and D. Zorin
In Preparation, 2016


A sparse multidimensional FFT for real positive vectors
PierreDavid Letourneau, M. Harper Langston, Benoit Meister, Richard Lethin
In Submission, http://arxiv.org/abs/1604.06682, 2016
We present a sparse multidimensional FFT randomized algorithm (sMFFT) for positive real vectors.
The algorithm works in any fixed dimension, requires an
(almost)optimal number of samples (O(Rlog(NR))) and runs in O(Rlog(R)log(NR))
complexity (where N is the size of the vector and R the number of nonzeros).
It is stable to noise and exhibits an exponentially small probability of failure.

 Accelerated Lowrank Updates to Tensor Decompositions
Muthu Baskaran, M. Harper Langston, Tahina Ramananandro, David BrunsSmith,
Tom Henretty, James Ezick, Richard Lethin
IEEE Conference on High Performance Extreme Computing (HPEC '16), 2016
Tensor analysis (through tensor decompositions) is increasingly becoming popular as a
powerful technique for enabling comprehensive and complete analysis of realworld data.
In many critical modern applications, largescale tensor data arrives continuously (in streams)
or changes dynamically over time. Tensor decompositions over static snapshots of tensor data become
prohibitively expensive due to space and computational bottlenecks, and severely limit the use of
tensor analysis in applications that require quick response. Effective and rapid streaming
(or nonstationary) tensor decompositions are critical for enabling largescale realtime analysis.
We present new algorithms for streaming tensor decompositions that effectively use the lowrank structure
of data updates to dynamically and rapidly perform tensor decompositions of continuously evolving data.
Our contributions presented here are integral for enabling tensor decompositions to become a viable analysis
tool for largescale timecritical applications. Further, we present our newlyimplemented parallelized
versions of these algorithms, which will enable more effective deployment of these algorithms in realworld
applications. We present the effectiveness of our approach in terms of faster execution of streaming
tensor decompositions that directly translate to short response time during analysis.

 A Sparse MultiDimensional Fast Fourier Transform with Stability to Noise in the Context of Image Processing and Change Detection
PierreDavid Letourneau, M. Harper Langston, Richard Lethin
IEEE Conference on High Performance Extreme Computing (HPEC '16), 2016
We present the sparse multidimensional FFT (sMFFT) for positive real vectors with application to image processing. Our algorithm works in any fixed dimension, requires an (almost)  optimal number of samples
O(Rlog(N/R)) and runs in O(Rlog(N/R)) complexity (to first order) for N unknowns and R nonzeros. It is stable to noise and exhibits an exponentially small probability of failure. Numerical results show sMFFT’s large quantitative and qualitative strengths as compared to
l1minimization for Compressive Sensing as well as advantages in the context of image processing and change detection.


An Interactive Visual Tool for Code Optimization and Parallelization Based on the
Polyhedral Model
Eric Papenhausen, Klaus Mueller, M. Harper Langston, Benoit Meister, Richard
Lethin
Sixth International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2016),
Held in conjunction with ICPP2016, the 45rd International Conference on Parallel Processing,
August 1619, 2016, Philadelphia, PA
Writing high performance software requires the programmer to take advantage of multicore processing. This can be done through tools like OpenMP; which allow the programmer to mark parallel loops. Identifying parallelizable loops, however, is a nontrivial task. Furthermore, transformations can be applied to a loop nest to expose parallelism. Polyhedral compilation has become an increasingly popular technique for exposing parallelism in computationally intensive loop nests. These techniques can simultaneously optimize for a number of performance parameters (i.e. parallelism, locality, etc). This is typically done using a cost model designed to give good performance in the general case. For some problems, the compiler may miss optimization opportunities or even produce a transformation that leads to worse performance. In these cases, the user has little recourse; since there are few options for the user to affect the transformation decisions. In this paper we present PUMAV, a visualization interface that helps the user understand and affect the transformations made by an optimizing compiler based on the polyhedral model. This tool visualizes performance heuristics and runtime performance statistics to help the user identify missed optimization opportunities. Changes to the transformed code can be made by directly manipulating the visualizations. We show an example where performance is greatly improved over the polyhedral model alone by using our tool.


Optimization of the Domain Dslash Operator for Intel Xeon CPUs
Meifeng Lin, Eric Papenhausen, M. Harper Langston, Benoit Meister, Muthu
Baskaran, Chulwoo Jung, Taku Izubuchi
34th International Symposium on Lattice Field Theory, 2016
In Lattice QCD (LQCD) simulations, the most computation intensive part is the inversion of the fermion Dirac matrix, M. The recurring component of the matrix inversions is the application of the Dirac matrix on a fermion vector. For Wilson fermions, the Dirac matrix can be written as M = 1 − κD, up to a normalization factor, where κ is the hopping parameter, and D is the derivative part of the fermion matrix, the Dslash operator. The matrixvector multiplication in LQCD essentially reduces to the application of the Dslash operator on a fermion vector. The motivations for this work are to see if sourcetosource code generators can produce reasonably performant code if
only given a naive implementation of the Dslash operator as an input;
to investigate optimization strategies in terms of SIMD vectorization, OpenMP
multithreading and multinode scaling with MPI. To produce efficient Dslash code, optimizations in terms of data layout, SIMD, OpenMP scaling and internode communications have been studied.
By vectorizing and changing the memory access pattern, we obtained 34% peak singlecore performance in single precision. On single node, OpenMP scaling deteriorates after 16 threads. Multinode strong scaling is limited by the communication cost.


Polyhedral User Mapping and Assistant Visualizer
Tool for the RStream AutoParallelizing Compiler
Eric Papenhausen, Bing Wang, Harper Langston, Muthu Baskaran, Tom Henretty, Taku Izubuchi, Ann Johnson, Chulwoo Jung, Meifeng Lin, Benoit Meister, Klaus Mueller, Richard Lethin
IEEE 3rd annual Working Conference on Software Visualization (IEEE VISSOFT),Bremen, Germany, IEEE, 2015
Existing highlevel, sourcetosource compilers can
accept input programs in a highlevel language (e.g.,
C
) and
perform complex automatic parallelization and other mappings
using various optimizations. These optimizations often require
tradeoffs and can benefit from the user’s involvement in the
process. However, because of the inherent complexity, the barrier
to entry for new users of these highlevel optimizing compilers can
often be high. We propose visualization as an effective gateway
for nonexpert users to gain insight into the effects of parameter
choices and so aid them in the selection of levels best suited to
their specific optimization goals.
A popular optimization paradigm is polyhedral mapping
which achieves optimization by loop transformations. We have
augmented a commercial polyhedralmodel sourcetosource
compiler (RStream) with an interactive visual tool we call the
Polyhedral User Mapping and Assistant Visualizer (PUMAV).
PUMAV is tightly integrated with the RStream sourcetosource
compiler and allows users to explore the effects of difficult
mappings and express their goals to optimize tradeoffs. It
implements advanced multivariate visualization paradigms such
as parallel coordinates and correlation graphs and applies them
in the novel setting of compiler optimizations.
We believe that our tool allows programmers to better
understand complex program transformations and deviations
of mapping properties on well understood programs. This in
turn will achieve experience and performance portability across
programs architectures as well as expose new communities in the
computational sciences to the rich features of autoparallelizing
highlevel sourcetosource compilers.


Optimizing the domain wall fermion Dirac operator
using the RStream sourcetosource compiler
Meifeng Lin, Eric Papenhausen, M Harper Langston, Benoit Meister, Muthu Baskaran, Taku Izubuchi, Chulwoo Jung
33rd International Symposium on Lattice Field Theory, Kobe International Conference Center, Kobe, Japan, 14 18 July 2015
The application of the Dirac operator on a spinor field, the Dslash operation, is the most
computationintensive part of the lattice QCD simulations. It is often the key kernel to optimize
to achieve maximum performance on various platforms. Here we report on a project to optimize
the domain wall fermion Dirac operator in Columbia Physics System (CPS) using the RStream
sourcetosource compiler. Our initial target platform is the Intel PC clusters. We discuss the
optimization strategies involved before and after the automatic code generation with RStream
and present some preliminary benchmark results.


ReIntroduction of CommunicationAvoiding FMMAccelerated FFTs with GPU Acceleration
M. Harper Langston, Muthu Baskaran, Benoıt Meister, Nicolas Vasilache and Richard Lethin
IEEE Conference on High Performance Extreme Computing (HPEC '13)
As distributed memory systems grow larger, communication demands have increased. Unfortunately, while the costs of arithmetic operations continue to decrease rapidly, communication costs have not. As a result, there has been a growing interest in communicationavoiding algorithms for some of the classic problems in numerical computing, including communicationavoiding Fast Fourier Transforms (FFTs). A previouslydeveloped lowcommunication FFT, however, has remained largely out of the picture, partially due to its reliance on the Fast Multipole Method (FMM), an algorithm that typically aids in accelerating dense computations. We have begun an algorithmic investigation and reimplementation design for the FMMaccelerated FFT, which exploits the ability to tune precision of the result (due to the mathematical nature of the FMM) to reduce powerburning communication and computation, the potential benefit of which is to reduce the energy required for the fundamental transform of digital signal processing. We reintroduce this algorithm as well as discuss new innovations for separating the distinct portions of the FMM into a CPUdedicated process, relying on interprocessor communication for approximate interactions, and a GPUdedicated process for dense interactions with no communication.


A Tale of Three Runtimes
Nicolas Vasilache, Muthu Baskaran, Tom Henretty, Benoit Meister,
M. Harper Langston, Sanket Tavarageri, Richard Lethin
In Submission, 2013
This contribution discusses the automatic generation of eventdriven, tuplespace based programs for taskoriented execution models from a sequential C specification. We developed a hierarchical mapping solution using autoparallelizing compiler technology to target three different runtimes relying on eventdriven tasks (EDTs). Our solution benefits from the important observation that loop types encode short, transitive relations among EDTs that are compact and efficiently evaluated at runtime. In this context, permutable loops are of particular importance as they translate immediately into conservative pointtopoint synchronizations of distance 1. Our solution generates calls into a runtimeagnostic C++ layer, which we have retargeted to Intel's Concurrent Collections (CnC), ETI's SWARM, and the Open Community Runtime (OCR). Experience with other runtime systems motivates our introduction of support for hierarchical asyncfinishes in CnC. Experimental data is provided to show the benefit of automatically generated code for EDTbased runtimes as well as comparisons across runtimes.

 A
massively parallel adaptive fastmultipole method on
heterogeneous architectures
I. Lashuk, A. Chandramowlishwaran, H. Langston,
T.A. Nguyen, R. S. Sampath, A. Shringarpure,
R. W. Vuduc, L. Ying, D. Zorin, and G. Biros
Communications of the ACM, vol. 55, 5, 2012, pp. 101109
We describe a parallel fast multipole method (FMM) for
highly nonuniform distributions of particles. We employ both distributed
memory parallelism (via MPI) and shared memory parallelism (via OpenMP
and GPU acceleration) to rapidly evaluate twobody nonoscillatory potentials
in three dimensions on heterogeneous high performance computing architectures.
We have performed scalability tests with up to 30 billion particles on 196,608
cores on the AMD/CRAYbased Jaguar system at ORNL. On a GPUenabled system
(NSF's Keeneland at Georgia Tech/ORNL), we observed 30x speedup over a single
core CPU and 7x speedup over a multicore CPU implementation. By combining GPUs
with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a
run with 48 million nonuniformly distributed particles on 192 GPUs.
 
A FreeSpace Adaptive FMMBased PDE Solver in
Three Dimensions
H. Langston, L. Greengard, and D. Zorin
Communications in Applied Mathematics and
Computational Science, vol. 6, no. 1, 2011, pp. 79–122
We present a kernelindependent, adaptive fast
multipole method (FMM) of arbi trary order
accuracy for solving elliptic PDEs in three
dimensions with radiation and periodic boundary
conditions. The algorithm requires only the
ability to evaluate the Green's function for the
governing equation and a representation of the
source distribution (the righthand side) that
can be evaluated at arbitrary points. The
performance is accelerated in three ways. First,
we construct a piecewise polynomial
approximation of the righthand side and compute
farfield expansions in the FMM from the
coefficients of this approximation. Second, we
precompute tables of quadratures to handle the
nearfield interactions on adaptive octree data
structures, keeping the total storage
requirements in check through the exploitation
of symmetries. Third, we employ sharedmemory
parallelization methods and loadbalancing
techniques to accelerate the major algorithmic
loops of the FMM. We present numerical examples
for the Laplace, modified Helmholtz and Stokes equations.
  A
massively parallel adaptive fastmultipole method on
heterogeneous architectures
I. Lashuk, A. Chandramowlishwaran, H. Langston,
T.A. Nguyen, R. S. Sampath, A. Shringarpure,
R. W. Vuduc, L. Ying, D. Zorin, and G. Biros
Proceedings of SC09 ACM/ICEE SCXY Conference Series,
pp. 111, 2009, Best Paper Finalist
We present new scalable algorithms and a new
implementation of our kernelindependent fast
multipole method (Ying et al. ACM/IEEE SC '03),
in which we employ both distributed memory
parallelism (via MPI) and shared
memory/streaming parallelism (via GPU
acceleration) to rapidly evaluate twobody
nonoscillatory potentials. On traditional
CPUonly systems, our implementation scales well
up to 30 billion unknowns on 65K cores
(AMD/CRAYbased Kraken system at NSF/NICS) for
highly nonuniform point distributions. We
achieve scalability to such extreme core counts
by adopting a new approach to scalable MPIbased
tree construction and partitioning, and a new
reduction algorithm for the evaluation
phase. Taken together, these components show
promise for ultrascalable FMM in the petascale
era and beyond.
  Cash:
Distributed Cooperative Buffer Caching
C. Decoro, H. Langston, J. Weinberger
Technical Report, 2004
Modern servers pay a heavy price in block access
time on diskbound workloads when the working set
is greater than the size of the local buffer
cache. We provide a mechanism for cooperating
servers to coordinate and share their local
buffer caches. The coordinated buffer cache can
handle working sets on the order of the
aggregate cache memory, greatly imptoving
performance on diskbound workloads. This
facility is provided with minimal communication
overhead, no penalty for local cache hits, and
without any explicit kernel support.
  A
New Parallel KernelIndependent Fast Multipole
Method
L. Ying, G. Biros, H. Langston, and
D. Zorin
Proceedings of SC03 ACM/ICEE SCXY Conference Series,
pp. 111, 2003, Best Student Paper, Gordon Bell
prize finalist
We present a new adaptive
fast multipole algorithm and its parallel
implementation. The algorithm is
kernelindependent in the sense that the
evaluation of pairwise interactions does not rely
on any analytic expansions, but only utilizes
kernel evaluations. The new method provides the
enabling technology for many important problems in
computational science and engineering. Examples
include viscous flows, fracture mechanics and
screened Coulombic interactions. Our MPIbased
parallel implementation logically separates the
computation and communication phases to avoid
synchronization in the upward and downward
computation passes, and thus allows us to fully
exploit computation and communication
overlapping. We measure isogranular and fixedsize
scalability for a variety of kernels on the
Pittsburgh Supercomputing Center's TCS1
Alphaserver on up to 3000 processors. We have
solved viscous flow problems with up to 2.1
billion unknowns and we have achieved 1.6 Tflops/s
peak performance and 1.13 Tflops/s sustained performance.

Research Projects and Codes
3D KernelIndependent FMMBased FreeSpace and Periodic Volume Solver
This volume code will be posted soon!

Brain Tumor and Heart Image Registration Experiments
Images, movies and results of work with Rahul
Sampath and George Biros. Being processed and will
be posted soon.

Kernel Independent 3D Fast Multipole Method
This is the original KIFMM3d code
of Lexing
Ying, for
which documentation is provided here.

Fluid
Dynamics Visualization
Scientific visualization performed with George
Biros and Lexing Ying,
using their embedded integral solver for the Stokes
equations. Full videos and pictures for some
results as appeared at SC03.

Line Integral
Convolution (LIC) code
Working with Denis Zorin, George Biros and Lexing Ying to
visualize 2d fluid dynamics simulations, specifically Navier Stokes
problems in both steady and unsteadystates.
Resulting images include 2d unsteady cavity problems
for different Reynold's numbers. Code is attached
as well as some movies formed from static images.

Vortex Methods
Code and images/movies for simulating fastmoving
fluids areound abjects in two and three
dimensions. Code currently being processed for
distribution and will be posted soon.

Fast Poisson Solver in 2D
Fast Fourier Transform based Poisson equation
solver in a regular two dimensional domain with
inhomogeneous Dirichlet boundary conditions. All
code in Matlab.

HotShell
A customizable Unix shell, designed to sit on top
of an existing shell with augmented commands in
Perl and Korn shell scripting languages.

FileDerived Motion Machine
A motion machine for animating an articulated
figure, in this case a human figure, derived with
simple cubes. Examples for specific textfile
inputs and JAVA code attached.

MotionCapture Experiments
Simple motion capture and image processing
experiments and projects.

