GPU and Exascale Computing

prysm is design in the way that it is largely so that it may scale relatively infinitely, and perform computations that are simply beyond the reach of other physical optics programs. The official model of the Low Order Wavefront Sensor (LOWFS) for the Roman Coronagraph Instrument built with prysm is run at a rate 900,000x higher than its predecesor based on PROPER, for example.

In this notebook, we will go through how to achieve such extreme speedups.

GPUs

prysm’s support for GPUs is implicit; the library uses a shim over numpy and scipy. If a different library with the same interface but GPUs as an execution unit exist (multiple do) then they can be plugged into prysm and used without issue. You will likely have the most success using Cupy, which is higher quality than Pytorch, Tensorflow, and other ML-intended libraries. Cupy runs faster with less interface headaches.

prysm itself does not import numpy in each file, but performs

from .mathops import np, fft, ndimage, ...etc

Within mathops, the above-discussed shim is defined, and the np variable overwritten. At startup, prysm.mathops.np always wraps numpy, and ndimage, special, interpolate, fft wrap scipy.

To use a GPU as a backend, simply perform

import cupy as cp
from cupyx.scipy import (
    fft as cpfft,
    special as cpspecial,
    interpolate as cpinterpolate,
    ndimage as cpndimage
)

from prysm.mathops import np, ndimage, interpolate, special, fft

np._srcmodule = cp
ndimage._srcmodule = cpndimage
special._srcmodule = cpspecial
interpolate._srcmodule = cpinterpolate
fft._srcmodule = cpfft

From this point on, any computation within prysm will be done on the GPU.

The majority of GPUs have much better performance with 32-bit floats than with 64-bit floats, so you may also wish to do

from prysm.conf import config

config.precision = 32

This makes prysm use 32 bit numbers for all computations. Note that if you create your own inputs with 64 bit floats, they will “infect” the model due to numpy’s promotion rules, and the 32-bit configuration of prysm will be effectively overwritten.

Multiple wavelengths can be run in parallel on the same GPU, if the GPU is “large” enough for this to make sense, or on multiple GPUs. The latter requires creating the model’s static data on each device (cp.cuda.runtime.setdevice) and passing the device number to the map function discussed below. The results must be collected on either one GPU or on the CPU at the end.

Manycore Systems and CPU parallelization

prysm makes no effort at all internally to multi-thread or parallelize computations. It also does not influence the decisions about this made by its dependencies. For instance, matrix multiplication is parallelized by numpy when it is linked to intel MKL automatically.

To control CPU backends, typically you may set

export OPENMP_NUM_THREADS="N"

where N is some integer number you have determined performs well. Alternatively, use the threadpoolctl library to do this for you. There is also the fft.set_workers(N) context manager.

As a general comment, the algorithms of physical optics do not “internally” parallelize very well. It is difficult to make a monochromatic problem compute significantly faster on a computer by attempting to parallelize the internal computations. It is effective to parallelize across wavelengths of a polychromatic problem. The concurrent.futures.ThreadPoolExecutor class and its map method are useful here. ProcessPoolExecutor may perform somewhat better. The pool should be created outside of any loops, as spinning it up and down is a significant cost that need not be paid more than once.

These parallelization methods are largely ineffective on consumer desktop platforms due to limited memory bandwidth. They work better on workstation and server platforms with a larger number of memory channels. Using > 1 thread per wavelength is often judicious to maximize performance. This is done by combining the two techniques described just before.

Sense of Capacity

As a general sense of how fast prysm can be, multi-plane diffraction models can be run at about 2ms per 7-plane model per wavelength at a total size of 1024x1024 samples, using a Titan XP GPU. The same model runs in about 50 ms per plane on a dual intel xeon 6248R platform. The polychromatic model can be made to run at an aggregate time of 60 ms for 9 wavelengths on the GPU and 250 ms for CPU, utilizing all available cores with optimum tuning via threadpoolctl and a ThreadPoolExecutor.