Location: Minnesota, United States
Qualifications: B.Sc, Ph.D, Computer Science
Work Authorization: USA, Canada, EU
Research interests: vectorization, math optimizations, compilers, computer architecture


2020 - current


Senrior Compiler Optimization Expert

• NV CPU Compiler Group
• Fortran, C, C++

2011 - 2020

Cray, Inc.

Senior Compiler Engineer

• Cray Compiler Optimization Group
• Fortran, C, C++


2007 - 2011

Trinity College Dublin

Ph.D, Computer Science

• Supervisor: David Gregg
• Research funding by: IRCSET and Intel Ireland

2003 - 2007

Lake Superior State University

B.Sc, Computer Science; minor, Mathematics

• Summa Cum Laude (major), Magna Cum Laude (overall)
• Special distinction, top Computer Science graduate 2007


Program Generation for Intel AES New Instructions

Doctoral Thesis, Trinity College Dublin, Ireland

Defended: March 03, 2011

High-performance primitive libraries are used to replace parts of sub-optimal code with optimized implementations. These libraries often come in the form of highly-optimized assembly routines, which raises several issues. Small changes to assembly routines can require significant rewrites. New versions of microarchitectures will often require changes in the assembly to keep code both efficient and functional. Maintaining multiple versions of the same basic piece of assembly code is a costly software engineering problem. One approach to solving this problem is using a program generator.
        AES-NI is an instruction-set extension on Intel processors that implement a full round of AES encryption in a single instruction. Existing libraries use hand-tuned assembly language to overlap the execution of multiple AES instructions to extract maximum performance. In this dissertation, we argue that using a program generator is suitable substitute for writing highly-optimized assembly routines that use AES-NI. We present a program generation system that seamlessly integrates high-level algorithmic choices with scheduling strategies that exploit instruction-level parallelism.
        This program generation system returns AES implementations that achieve near optimal performance. We also show that the generator is dynamic enough to take exploratory approaches when optimizing code. As a result, this dissertation also contributes two novel encryption modifications. For CTR mode, we present a "mixed-mode" operation that combines traditional table lookup optimizations with AES-NI instructions. In cyclic modes, such as CBC, we show how manipulating the xor instructions can shorten the chain of dependent operations. These optimized implementations are found using an adapted simulated annealing algorithm. We show these implementations can achieve similar or superior cycle per byte times compared to the high-performance library versions provided by Intel. The end result is a program generation technique that could potentially be adapted to optimize other algorithms that rely on instruction-set extensions.

• Academic Supervisor: David Gregg
• Examiners: Jeremy Singer and John Waldron
pdf (via TCD Reseach Archive)

A Program Generator for Intel AES-NI Instructions

Raymond Manley and David Gregg

Indocrypt 2010, Hyderabad, India

December 2010

Recent Intel processors provide hardware instructions that implement a full AES round in a single instruction. Existing libraries use hand-tuned assembly language to overlap the execution of multiple AES instructions and extract maximum performance. We present a program generator that creates optimized AES code automatically from a simple, annotated C version of the code. We show how this generator can be used to rapidly create highly optimized versions of several AES modes. The resulting code generated has performance that is equal to, or up to 7% faster than the hand-tuned assembly libraries from Intel.

• please contact for pdf copy

Code Generation for Hardware Accelerated AES

Raymond Manley, Paul Magrath and David Gregg

ASAP 2010, Rennes, France

July 2010

Data must be encrypted if it is to remain confidential when sent over computer networks. Encryption solves many problems involving invasion of privacy, identity theft, fraud, and data theft. However for encryption to be widely used, it must be fast. The problem is so important that new Intel processors provide hardware support for encryption. These instructions implement key stages of the Advanced Encryption Standard (AES), allowing encryption to be completed more quickly and using less power. The AES algorithm consists of several rounds of encryption, each of which involves a relatively complicated computation. This new hardware support allows an entire round to be implemented with just a single instruction.
        An implementation of the AES algorithm using these instructions contains several code sections that can be fine tuned for optimal performance. However, these optimizations are usually done by hand, which can be a lengthy, labour intensive process. We present a system that can generate billions of variants of the AES encryption code to find the best solution for a particular microarchitecture. We apply both common loop optimizations and ones specific to AES. We evaluate the generated code on hardware with built-in AES support using both selective-brute force and guided searches. Our generator achieves significant speedups over a straightforward implementation of the code.

IEEE Digital Library
• please contact for pdf copy

Mapping Streaming Languages to General Purpose Processors through Vectorization

Raymond Manley and David Gregg

Languages and Compilers for Parallel Computing 2009, Newark, DE, USA

October 2009

Streaming languages were originally aimed at streaming architectures, but recent work has shown the stream programming model to be useful in exploiting parallelism on general purpose processors. Current research in mapping stream code onto GPPs deals with load balancing and generating threads based on hardware features. We look into improving problems associated with stream data locality and stream data parallelism on GPPs. We suggest that automatically generating vectorized code for these streaming operations is a potential solution. We use the Brook stream language as our syntax base and augment it to generate vector intrinsics targeting the x86 architecture. This compiler uses both existing and new strategies to transform high-level streaming kernel code into vector instructions without requiring additional annotations. We compare our system's results to existing mapping strategies aimed at using stream code on GPPs. When evaluating performance, we see a wide range of speedups from a few percent to over 2x and discuss the level of effectiveness of using vector code over scalar equivalents in specific application domains.

• please contact for pdf copy

About me

I was born in the United States, grew up in Canada, and eventually found myself in Ireland for several years before settling in Minenapolis, Minnesota. My hobbies include both playing and listening to many musical genres, woodworking, gardening, cooking, and cycling. For more information, please contact me via email or LinkedIn.