phone: E-mail for info
e-mail: Click here
CV: M. Harper Langston


PhD in Computer Science: NYU Courant Institute of Mathematical Sciences
Dissertation Title: An Adaptive Fast Volume Solver in Three Dimensions for Free-Space, Periodic and Dirichlet Boundary Conditions
Honors: McCracken Fellowship
Advisors Denis Zorin and Leslie Greengard.
MS in Scientific Computing: NYU Courant Institute of Mathematical Sciences
BA in Mathematics and Russian: Bowdoin College
Research Topic: Properties of Non-Unique Factorization in Quadratic Fields
Honors: Smyth Prize in Mathematics, Magna Cum Laude, James Bowdoin Scholor
Advisors: Wells Johnson and Adam Levy.

Publications and Reports

PUMA-V: Optimizing Parallel Code Performance Through Interactive Visualization
Eric Papenhause, M. Harper Langston, Benoit Meister, Richard Lethin, Klaus Mueller
2018 IEEE Computer Graphics and Applications [Feature Article]
Performance optimization for parallel, loop-oriented programs compromises between parallelism and locality. We present a visualization interface which allows programmers to assist the compiler in generating optimal code. It greatly improves the user’s understanding of the transformations that took place and aids in making additional transformations in a visually intuitive way.
Topic Modeling for Analysis of Big Data Tensor Decompositions
Thomas Henretty, M. Harper Langston, Muthu Baskaran, James Ezick, Richard Lethin
SPIE Proceedings Volume 10652, Disruptive Technologies in Information Sciences; 1065208 (2018), doi: 10.1117/12.2306933
Tensor decompositions are a class of algorithms used for unsupervised pattern discovery. Structured, multidimensional datasets are encoded as tensors and decomposed into discrete, coherent patterns captured as weighted collections of high-dimensional vectors known as components. Tensor decompositions have recently shown promising results when addressing problems related to data comprehension and anomaly discovery in cybersecurity and intelligence analysis. However, analysis of Big Data tensor decompositions is currently a critical bottleneck owing to the volume and variety of unlabeled patterns that are produced. We present an approach to automated component clustering and classification based on the Latent Dirichlet Allocation (LDA) topic modeling technique and show example applications to representative cybersecurity and geospatial datasets.v
Adaptive Inhomogeneous PDE Solvers for Complex Geometries
H. Langston, L. Greengard, and D. Zorin
In Preparation, 2016
Memory-efficient Parallel Tensor Decompositions
Muthu Baskaran, Tom Henretty, Benoit Pradelle, M. Harper Langston, David Bruns-Smith, James Ezick, Richard Lethin
2017 IEEE High Performance Extreme Computing Conference (HPEC '17), Waltham, MA, USA. [Best Paper Award Winner]
Tensor decompositions are a powerful technique for enabling comprehensive and complete analysis of real-world data. Data analysis through tensor decompositions involves intensive computations over large-scale irregular sparse data. Optimizing the execution of such data intensive computations is key to reducing the time-to-solution (or response time) in real-world data analysis applications. As high-performance computing (HPC) systems are increasingly used for data analysis applications, it is becoming increasingly important to optimize sparse tensor computations and execute them efficiently on modern and advanced HPC systems. In addition to utilizing the large processing capability of HPC systems, it is crucial to improve memory performance (memory usage, communication, synchronization, memory reuse, and data locality) in HPC systems.
In this paper, we present multiple optimizations that are targeted towards faster and memory-efficient execution of large-scale tensor analysis on HPC systems. We demonstrate that our techniques achieve reduction in memory usage and execution time of tensor decomposition methods when they are applied on multiple datasets of varied size and structure from different application domains. We achieve up to 11x reduction in memory usage and up to 7x improvement in performance. More importantly, we enable the application of large tensor decompositions on some important datasets on a multi-core system that would not have been feasible without our optimization.
Discovering Deep Patterns In Large Scale Network Flows Using Tensor Decompositions
James Ezick, Muthu Baskaran, David Bruns-Smith, Alan Commike, Thomas Henretty, M. Harper Langston, Jordi Ros-Giralt, Richard Lethin
FloCon 2017, San Diego, CA
We present an approach to a cyber security workflow based on ENSIGN, a high-performance implementation of tensor decomposition algorithms that enable the unsupervised discovery of subtle undercurrents and deep, cross-dimensional correlations within multi-dimensional data. This new approach of starting from identified patterns in the data complements traditional workflows that focus on highlighting individual suspicious activities. This enhanced workflow assists in identifying attackers who craft their actions to subvert signature-based detection methods and automates much of the labor intensive forensic process of connecting isolated incidents into a coherent attack profile.
Adaptive Inhomogeneous PDE Solvers for Complex Geometries
H. Langston, L. Greengard, and D. Zorin
In Preparation, 2016
A sparse multidimensional FFT for real positive vectors
Pierre-David Letourneau, M. Harper Langston, Benoit Meister, Richard Lethin
In Submission,, 2016
We present a sparse multidimensional FFT randomized algorithm (sMFFT) for positive real vectors. The algorithm works in any fixed dimension, requires an (almost)-optimal number of samples (O(Rlog(NR))) and runs in O(Rlog(R)log(NR)) complexity (where N is the size of the vector and R the number of nonzeros). It is stable to noise and exhibits an exponentially small probability of failure.
Accelerated Low-rank Updates to Tensor Decompositions
Muthu Baskaran, M. Harper Langston, Tahina Ramananandro, David Bruns-Smith, Tom Henretty, James Ezick, Richard Lethin
IEEE Conference on High Performance Extreme Computing (HPEC '16), 2016
Tensor analysis (through tensor decompositions) is increasingly becoming popular as a powerful technique for enabling comprehensive and complete analysis of real-world data. In many critical modern applications, large-scale tensor data arrives continuously (in streams) or changes dynamically over time. Tensor decompositions over static snapshots of tensor data become prohibitively expensive due to space and computational bottlenecks, and severely limit the use of tensor analysis in applications that require quick response. Effective and rapid streaming (or non-stationary) tensor decompositions are critical for enabling large-scale real-time analysis. We present new algorithms for streaming tensor decomposi-tions that effectively use the low-rank structure of data updates to dynamically and rapidly perform tensor decompositions of continuously evolving data. Our contributions presented here are integral for enabling tensor decompositions to become a viable analysis tool for large-scale time-critical applications. Further, we present our newly-implemented parallelized versions of these algorithms, which will enable more effective deployment of these algorithms in real-world applications. We present the effectiveness of our approach in terms of faster execution of streaming tensor decompositions that directly translate to short response time during analysis.
A Sparse Multi-Dimensional Fast Fourier Transform with Stability to Noise in the Context of Image Processing and Change Detection
Pierre-David Letourneau, M. Harper Langston, Richard Lethin
IEEE Conference on High Performance Extreme Computing (HPEC '16), 2016
We present the sparse multidimensional FFT (sMFFT) for positive real vectors with application to image processing. Our algorithm works in any fixed dimension, requires an (almost) - optimal number of samples O(Rlog(N/R)) and runs in O(Rlog(N/R)) complexity (to first order) for N unknowns and R nonzeros. It is stable to noise and exhibits an exponentially small probability of failure. Numerical results show sMFFT’s large quantitative and qualitative strengths as compared to l1-minimization for Compressive Sensing as well as advantages in the context of image processing and change detection.
An Interactive Visual Tool for Code Optimization and Parallelization Based on the Polyhedral Model
Eric Papenhausen, Klaus Mueller, M. Harper Langston, Benoit Meister, Richard Lethin
Sixth International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2016), Held in conjunction with ICPP-2016, the 45rd International Conference on Parallel Processing, August 16-19, 2016, Philadelphia, PA
Writing high performance software requires the programmer to take advantage of multi-core processing. This can be done through tools like OpenMP; which allow the programmer to mark parallel loops. Identifying parallelizable loops, however, is a non-trivial task. Furthermore, transformations can be applied to a loop nest to expose parallelism. Polyhedral compilation has become an increasingly popular technique for exposing parallelism in computationally intensive loop nests. These techniques can simultaneously optimize for a number of performance parameters (i.e. parallelism, locality, etc). This is typically done using a cost model designed to give good performance in the general case. For some problems, the compiler may miss optimization opportunities or even produce a transformation that leads to worse performance. In these cases, the user has little recourse; since there are few options for the user to affect the transformation decisions. In this paper we present PUMA-V, a visualization interface that helps the user understand and affect the transformations made by an optimizing compiler based on the polyhedral model. This tool visualizes performance heuristics and runtime performance statistics to help the user identify missed optimization opportunities. Changes to the transformed code can be made by directly manipulating the visualizations. We show an example where performance is greatly improved over the polyhedral model alone by using our tool.
Optimization of the Domain Dslash Operator for Intel Xeon CPUs
Meifeng Lin, Eric Papenhausen, M. Harper Langston, Benoit Meister, Muthu Baskaran, Chulwoo Jung, Taku Izubuchi
34th International Symposium on Lattice Field Theory, 2016
In Lattice QCD (LQCD) simulations, the most computation intensive part is the inversion of the fermion Dirac matrix, M. The recurring component of the matrix inversions is the application of the Dirac matrix on a fermion vector. For Wilson fermions, the Dirac matrix can be written as M = 1 − κD, up to a normalization factor, where κ is the hopping parameter, and D is the derivative part of the fermion matrix, the Dslash operator. The matrix-vector multiplication in LQCD essentially reduces to the application of the Dslash operator on a fermion vector. The motivations for this work are to see if source-to-source code generators can produce reasonably performant code if only given a naive implementation of the Dslash operator as an input; to investigate optimization strategies in terms of SIMD vectorization, OpenMP multithreading and multinode scaling with MPI. To produce efficient Dslash code, optimizations in terms of data layout, SIMD, OpenMP scaling and internode communications have been studied. By vectorizing and changing the memory access pattern, we obtained 34% peak single-core performance in single precision. On single node, OpenMP scaling deteriorates after 16 threads. Multinode strong scaling is limited by the communication cost.
Polyhedral User Mapping and Assistant Visualizer Tool for the R-Stream Auto-Parallelizing Compiler
Eric Papenhausen, Bing Wang, Harper Langston, Muthu Baskaran, Tom Henretty, Taku Izubuchi, Ann Johnson, Chulwoo Jung, Meifeng Lin, Benoit Meister, Klaus Mueller, Richard Lethin
IEEE 3rd annual Working Conference on Software Visualization (IEEE VISSOFT),Bremen, Germany, IEEE, 2015
Existing high-level, source-to-source compilers can accept input programs in a high-level language (e.g., C ) and perform complex automatic parallelization and other mappings using various optimizations. These optimizations often require trade-offs and can benefit from the user’s involvement in the process. However, because of the inherent complexity, the barrier to entry for new users of these high-level optimizing compilers can often be high. We propose visualization as an effective gateway for non-expert users to gain insight into the effects of parameter choices and so aid them in the selection of levels best suited to their specific optimization goals. A popular optimization paradigm is polyhedral mapping which achieves optimization by loop transformations. We have augmented a commercial polyhedral-model source-to-source compiler (R-Stream) with an interactive visual tool we call the Polyhedral User Mapping and Assistant Visualizer (PUMA-V). PUMA-V is tightly integrated with the R-Stream source-to-source compiler and allows users to explore the effects of difficult mappings and express their goals to optimize trade-offs. It implements advanced multivariate visualization paradigms such as parallel coordinates and correlation graphs and applies them in the novel setting of compiler optimizations. We believe that our tool allows programmers to better understand complex program transformations and deviations of mapping properties on well understood programs. This in turn will achieve experience and performance portability across programs architectures as well as expose new communities in the computational sciences to the rich features of auto-parallelizing high-level source-to-source compilers.
Optimizing the domain wall fermion Dirac operator using the R-Stream source-to-source compiler
Meifeng Lin, Eric Papenhausen, M Harper Langston, Benoit Meister, Muthu Baskaran, Taku Izubuchi, Chulwoo Jung
33rd International Symposium on Lattice Field Theory, Kobe International Conference Center, Kobe, Japan, 14 -18 July 2015
The application of the Dirac operator on a spinor field, the Dslash operation, is the most computation-intensive part of the lattice QCD simulations. It is often the key kernel to optimize to achieve maximum performance on various platforms. Here we report on a project to optimize the domain wall fermion Dirac operator in Columbia Physics System (CPS) using the R-Stream source-to-source compiler. Our initial target platform is the Intel PC clusters. We discuss the optimization strategies involved before and after the automatic code generation with R-Stream and present some preliminary benchmark results.
Re-Introduction of Communication-Avoiding FMM-Accelerated FFTs with GPU Acceleration
M. Harper Langston, Muthu Baskaran, Benoıt Meister, Nicolas Vasilache and Richard Lethin
IEEE Conference on High Performance Extreme Computing (HPEC '13)
As distributed memory systems grow larger, communication demands have increased. Unfortunately, while the costs of arithmetic operations continue to decrease rapidly, communication costs have not. As a result, there has been a growing interest in communication-avoiding algorithms for some of the classic problems in numerical computing, including communication-avoiding Fast Fourier Transforms (FFTs). A previously-developed low-communication FFT, however, has remained largely out of the picture, partially due to its reliance on the Fast Multipole Method (FMM), an algorithm that typically aids in accelerating dense computations. We have begun an algorithmic investigation and re-implementation design for the FMM-accelerated FFT, which exploits the ability to tune precision of the result (due to the mathematical nature of the FMM) to reduce power-burning communication and computation, the potential benefit of which is to reduce the energy required for the fundamental transform of digital signal processing. We reintroduce this algorithm as well as discuss new innovations for separating the distinct portions of the FMM into a CPU-dedicated process, relying on inter-processor communication for approximate interactions, and a GPU-dedicated process for dense interactions with no communication.
A Tale of Three Runtimes
Nicolas Vasilache, Muthu Baskaran, Tom Henretty, Benoit Meister, M. Harper Langston, Sanket Tavarageri, Richard Lethin
In Submission, 2013
This contribution discusses the automatic generation of event-driven, tuple-space based programs for task-oriented execution models from a sequential C specification. We developed a hierarchical mapping solution using auto-parallelizing compiler technology to target three different runtimes relying on event-driven tasks (EDTs). Our solution benefits from the important observation that loop types encode short, transitive relations among EDTs that are compact and efficiently evaluated at runtime. In this context, permutable loops are of particular importance as they translate immediately into conservative point-to-point synchronizations of distance 1. Our solution generates calls into a runtime-agnostic C++ layer, which we have retargeted to Intel's Concurrent Collections (CnC), ETI's SWARM, and the Open Community Runtime (OCR). Experience with other runtime systems motivates our introduction of support for hierarchical async-finishes in CnC. Experimental data is provided to show the benefit of automatically generated code for EDT-based runtimes as well as comparisons across runtimes.
A massively parallel adaptive fast-multipole method on heterogeneous architectures
I. Lashuk, A. Chandramowlishwaran, H. Langston, T.-A. Nguyen, R. S. Sampath, A. Shringarpure, R. W. Vuduc, L. Ying, D. Zorin, and G. Biros
Communications of the ACM, vol. 55, 5, 2012, pp. 101-109
We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30x speedup over a single core CPU and 7x speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.
A Free-Space Adaptive FMM-Based PDE Solver in Three Dimensions
H. Langston, L. Greengard, and D. Zorin
Communications in Applied Mathematics and Computational Science, vol. 6, no. 1, 2011, pp. 79–122
We present a kernel-independent, adaptive fast multipole method (FMM) of arbi- trary order accuracy for solving elliptic PDEs in three dimensions with radiation and periodic boundary conditions. The algorithm requires only the ability to evaluate the Green's function for the governing equation and a representation of the source distribution (the right-hand side) that can be evaluated at arbitrary points. The performance is accelerated in three ways. First, we construct a piecewise polynomial approximation of the right-hand side and compute far-field expansions in the FMM from the coefficients of this approximation. Second, we precompute tables of quadratures to handle the near-field interactions on adaptive octree data structures, keeping the total storage requirements in check through the exploitation of symmetries. Third, we employ shared-memory parallelization methods and load-balancing techniques to accelerate the major algorithmic loops of the FMM. We present numerical examples for the Laplace, modified Helmholtz and Stokes equations.
A massively parallel adaptive fast-multipole method on heterogeneous architectures
I. Lashuk, A. Chandramowlishwaran, H. Langston, T.-A. Nguyen, R. S. Sampath, A. Shringarpure, R. W. Vuduc, L. Ying, D. Zorin, and G. Biros
Proceedings of SC09 ACM/ICEE SCXY Conference Series, pp. 1-11, 2009, Best Paper Finalist
We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAYbased Kraken system at NSF/NICS) for highly non-uniform point distributions. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.
Cash: Distributed Cooperative Buffer Caching
C. Decoro, H. Langston, J. Weinberger
Technical Report, 2004
Modern servers pay a heavy price in block access time on diskbound workloads when the working set is greater than the size of the local buffer cache. We provide a mechanism for cooperating servers to coordinate and share their local buffer caches. The coordinated buffer cache can handle working sets on the order of the aggregate cache memory, greatly imptoving performance on diskbound workloads. This facility is provided with minimal communication overhead, no penalty for local cache hits, and without any explicit kernel support.
A New Parallel Kernel-Independent Fast Multipole Method
L. Ying, G. Biros, H. Langston, and D. Zorin
Proceedings of SC03 ACM/ICEE SCXY Conference Series, pp. 1-11, 2003, Best Student Paper, Gordon Bell prize finalist
We present a new adaptive fast multipole algorithm and its parallel implementation. The algorithm is kernel-independent in the sense that the evaluation of pairwise interactions does not rely on any analytic expansions, but only utilizes kernel evaluations. The new method provides the enabling technology for many important problems in computational science and engineering. Examples include viscous flows, fracture mechanics and screened Coulombic interactions. Our MPI-based parallel implementation logically separates the computation and communication phases to avoid synchronization in the upward and downward computation passes, and thus allows us to fully exploit computation and communication overlapping. We measure isogranular and fixed-size scalability for a variety of kernels on the Pittsburgh Supercomputing Center's TCS-1 Alphaserver on up to 3000 processors. We have solved viscous flow problems with up to 2.1 billion unknowns and we have achieved 1.6 Tflops/s peak performance and 1.13 Tflops/s sustained performance.

Research Projects and Codes

3D Kernel-Independent FMM-Based Free-Space and Periodic Volume Solver
This volume code will be posted soon!

Brain Tumor and Heart Image Registration Experiments
Images, movies and results of work with Rahul Sampath and George Biros. Being processed and will be posted soon.

Kernel Independent 3D Fast Multipole Method
This is the original KIFMM3d code of Lexing Ying, for which documentation is provided here.

Fluid Dynamics Visualization
Scientific visualization performed with George Biros and Lexing Ying, using their embedded integral solver for the Stokes equations. Full videos and pictures for some results as appeared at SC03.

Line Integral Convolution (LIC) code
Working with Denis Zorin, George Biros and Lexing Ying to visualize 2d fluid dynamics simulations, specifically Navier Stokes problems in both steady and unsteady-states. Resulting images include 2d unsteady cavity problems for different Reynold's numbers. Code is attached as well as some movies formed from static images.

Vortex Methods
Code and images/movies for simulating fast-moving fluids areound abjects in two and three dimensions. Code currently being processed for distribution and will be posted soon.

Fast Poisson Solver in 2D
Fast Fourier Transform based Poisson equation solver in a regular two dimensional domain with inhomogeneous Dirichlet boundary conditions. All code in Matlab.

A customizable Unix shell, designed to sit on top of an existing shell with augmented commands in Perl and Korn shell scripting languages.

File-Derived Motion Machine
A motion machine for animating an articulated figure, in this case a human figure, derived with simple cubes. Examples for specific text-file inputs and JAVA code attached.

Motion-Capture Experiments
Simple motion capture and image processing experiments and projects.