Using apprentice to construct and use Gaussian Process
The Bayesian regression approach is a probabilistic approach to find the posterior of the coefficients of a function given the input-output data points. This approach provides a distribution over the coefficients that gets updated whenever new data points are observed.
The Gaussian Process (GP) approach, in contrast, consists of a collection of latent functions such that it describes a Gaussian distribution over all functions that are consistent with the observed data. A GP is like an infinite dimensional (in coefficients) and multivariate Gaussian distribution where any finite collection of r.v.’s i.e., labels are jointly distributed.
A GP begins with a prior distribution and updates this with the observed data points, producing the posterior distribution over functions. Here the prior is specified on the function space that is converted to a posterior distribution using the observed data. From this distribution, we can obtain predictions on a point of interest as a joint distribution over the trained labels and the label at the point of interest.
In this tutorial, we describe how to setup the surrogate model construction problem, the options to construct the GP model, how to store and use the the GP model. More specifically, in this tutorial, you will be shown how to:
Test the install
Set up the GP surrogate model construction problem
Construct the GP surrogate model
Use the GP surrogate model
Getting started
To install apprentice, execute the following commands:
git clone git@github.com:HEPonHPC/apprentice.git
cd apprentice/
pip install .
cd ..
Then, test the installation as described in the test installation documentation.
Construct gaussian process surrogate model
There are multiple ways to construct a gaussian process (GP) object.
From interpolation points
To construct a GP object using from_interpolation_points,
we need data of size \(d \times N_p\),
where \(d\) is the dimension and the \(N_p\) is the number of data points.
Additionally, we need arguments that describe the strategy, kernel, and other
relevant model parameters:
GP = GaussianProcess.from_interpolation_points(X,Y,
seed=<int>,
kernel=<str>,
max_restarts=<int>,
keepout_percentage=<float>,
mean_surrogate_model=<object of apprentice.SurrogateModel>,
error_surrogate_model=<object of apprentice.SurrogateModel>,
sample_size=<int>,
stopping_bound=<float>,
strategy=<str>
)
In this call,
X is 2-D an array of size \(d \times N_p\) and it is the x data values to fit
Y is 1-D an array of size \(N_p\) and it is the y data values to fit
kernel is the GP kernel to use. Allowed kernels include:
sqe: Squared exponential kernel
ratquad: Rational quadratic kernel
matern32: Matern 3/2 kernel
matern52: Matern 5/2 kernel
poly: Polynomial kernel
or: Hybrid OR kernel (all of the above kernels summed together)
max_restarts is the maximum number of restarts to use in the hyperparameter tuning problem
keepout_percentage is the value in percent of the amount of holdout data to be used for testing. So (100-keepout_percentage)% of data will be used for training the GP
mean_surrogate_model: surrogate model over the prior mean
error_surrogate_model: surrogate model over the prior heteroschedastic variance
sample_size: number of samples of training dataset at each data point
stopping_bound: stopping condition for heteroschedastic GP tuning
strategy is the strategy to use
strategy = "1": Most Likely Heteroscedastic Gaussian Process (HeGP-ML)strategy = "2": Heteroscedastic Gaussian Process using Stochastic Kriging (HeGP-SK)strategy = "3": Homoscedastic Gaussian Process (HoGP)
Once the GP is constructed using from_interpolation_points,
the coefficients and other metrics of the GP fit can be obtained
into a variable as a dict using:
GP_output = GP.as_dict
Additionally, to save the coefficients and other metrics of the GP fit
to a file at location <file location> use:
GP.save(<file location>)
From data structure
Construct a saved GP object using from_data_structure from a variable
GP_output:
GP_from_data_structure = GaussianProcess.from_data_structure(GP_output)
From file
Construct a saved GP object using from_file from file location
<file location> using:
GP_from_file = GaussianProcess.from_file(tmp_file)
Operations allowed on the GP object
The following operations can be performed on a constructed GP object GP.
y = GP(x): compute the GP at a single point
x, an array of size \(d\).yis a single value of typefloatY = GP.f_X(X): compute the GP at multiple points
X, an array of size \(d \times N_p\).Yis an array of size \(N_p\).GP_output = GP.as_dict: get the hyperparameters and other metrics of the GP fit into a variable.
GP_outputis adict.GP.save(<file location>): save the hyperparameters and other metrics of the GP fit into a file at location
<file location>
More information about the code is in the code documentation for GP. Additionally, the GP unit test script contains the construction and usage of the operations over the GP object.