Bayesian Approach to Chlorophyll Estimation from Satellite Remote Sensing
This post describes the use of bayesian polynomial regression to estimate chlorophyll from remote sensing reflectance; the output is contrasted to that obtained via the frequentist ordinary least squares regression.
Chlorophyll is estimated from ocean color remote sensing data using one of two main types of algorithms; semi-analytical or empirical. The latter is a polynomial model, where the input is a ratio of bands and the coefficients are obtained via ordinary least squares fitting. These models are usually more successful than their semi-analytical counterparts and as a result are at the forefront of the operational algorithmic arsenal used by the Ocean Biology Processing Group at NASA Goddard. Like all models these suffer from a number of shortcomings. One of these is the lack of algorithm uncertainty, which makes it difficult to formulate some kind of confidence around the final product, chlorophyll. Moreover, data preparation, model training and model testing are also poorly documented. This lack of standardized algorithm development makes verification of previous results and duplication of procedure for algorithm development challenging. In this post, I demonstrate bayesian regression by focusing on the OC4 algorithm. The OC4 algorithm is part of the OCx family of empirical algorithms, which are generally expressed as follows: $$ log_{10}\left(chlor_a\right) = a_0 + \sum_{i=1}^{4}a_ilog_{10}\left(\frac{max\left(Rrs\left(\lambda_{blue}\right)\right)}{Rrs\left(\lambda_{green}\right)}\right)^i$$
The dataset used here is as close as possible to the dataset used for the calculation of the coefficients of the OC4 polynomial, the latest of which is OC4v6), obtained a few years ago using a frequentist least squares regression. I'll redo a least squares fit using the present dataset. I will then repeat the exercise using a bayesian regression approach, and explore the much richer information content of the inference results.
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import os
from scipy.stats import norm
import matplotlib.pyplot as pl
import pymc3 as pm
import pandas as pd
import seaborn as sb
import pickle
from sklearn.metrics import mean_squared_error, r2_score
from theano import shared
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
%matplotlib inline
Some helper functions:
def FitPoly(X,y, order=4, lin=False):
"""
Numpy regression. Returns coeffs.
kwargs:
lin: specifies whether data is log transformed. Data is log transformed if not."""
if lin:
X = np.log10(X)
y = np.log10(y)
coeffs = np.polyfit(X,y,deg=order)
return coeffs
def GetPPC(data, mcChain, pmModel):
"""
Runs post predictive checks (PPC).
Returns 50% and 95% Highest Probability Density (HPD) intervals of predicted chlorophyll.
Inputs:
data: log of band ratio, pandas Series object
mcChain: pymc3 trace object, result of mcmc sampling
pmModel: pymc3 model object.
Outputs:
sig0: 50% HPD interval of predicted log chlorophyll
sig1: 95% HPD interval of predicted log chlorophyll
"""
ppcTrain = pm.sample_ppc(mcChain, samples=1000, model=pmModel)
# sample posterior predictive check for 50% and 95% density interval
idx = np.argsort(data.values)
sig0 = pm.hpd(ppcTrain['chlPred'], alpha=0.5)[idx]
sig1 = pm.hpd(ppcTrain['chlPred'], alpha=0.05)[idx]
return sig0, sig1
def PlotPPC(data, chlObs, sig0, sig1, bayesCoeffs, oc4V6Coeffs=[], myOLSCoeffs=[],
titleTail='on Training Set'):
"""
Plots Posterior predictive checks, overlaid with 50% and 95% HPDs of predicted log chlorophyll
Optionally adds result of OC4V6 estimation, and estimation from OLS computed in this notebook.
Inputs:
data: log of band ratio, pandas Series object
chlObs: log of in situ observed chl, pandas Series object
sig0: 50% HPD interval of predicted log chlorophyll
sig1: 95% HPD interval of predicted log chlorophyll
bayesCoeffs: point estimate of bayesian regression coefficients
oc4v6Coeffs: historical OC4V6 coefficients
myOLSCoeffs: frequentist OLS coefficients
Outputs:
axis handle to resulting plot.
"""
chlBayes=0
idx = np.argsort(data.values)
ocxRatio_ord = data.values[idx]
ocxRatioMock = np.linspace(data.min(),data.max(),100)
for i,coeff in enumerate(bayesCoeffs):
chlBayes += coeff*ocxRatioMock**i
pl.figure(figsize=(16,10))
pl.fill_between(ocxRatio_ord, sig0[:,0], sig0[:,1], color='k', alpha=1, label='50% HPD.')
pl.fill_between(ocxRatio_ord, sig1[:,0], sig1[:,1], color='darkgray', alpha=0.5, label='95% HDP')
pl.plot(data,chlObs, 'ko', alpha=0.5, markeredgecolor='k', label='Observation')
if len(oc4V6Coeffs)>0:
oc4V6Modl=0
for i, coef in enumerate(oc4V6Coeffs):
oc4V6Modl += coef * ocxRatioMock**i
pl.plot(ocxRatioMock,oc4V6Modl,'r', linewidth=3, linestyle='--', label='OC4v6')
if len(myOLSCoeffs)>0:
myChlModl=0
for i,coeff in enumerate(myOLSCoeffs[::-1]):
myChlModl+= coeff * ocxRatioMock**i
pl.plot(ocxRatioMock,myChlModl,'m', linewidth=5, label='OLS')
pl.plot(ocxRatioMock,chlBayes,'b', linewidth=2, label='Bayesian OC4')
pl.xlabel(r'$log_{10}\frac{max(Rrs\left(\lambda_{blue}\right))}{Rrs\left(\lambda_{green}\right)}$', fontsize=20)
pl.ylabel(r'$log_{10}\left(chlor_a\right)$', fontsize=20);
pl.legend(fontsize=14);
pl.title('Posterior Predictive Check %s' % titleTail);
return pl.gca()
dfNomadV2 = pd.read_pickle('./bayesianChl_DATA/NOMADV2SWFClean.pkl')
Next is to inspect the data, which again, in contrast to the description found here, chlorophyll from fluorescence is included when hplc chlorophyll is not available.
dfNomadV2.describe()
dfNomadV2.info()
Looking at the immediately relevant data graphically:
sb.pairplot(dfNomadV2, vars=['rrs443', 'rrs489', 'rrs510', 'rrs555','chl_all']);
There's a strong collinearity particulary among neighboring bands. In a multivariate regression that might be a problem. Here though we use a ratio of the max of three adjacent bands (blue to blue-green) to the next band (green), so this is not necessarily a problem. The data appears mostly lognormally distributed (cf. also Campbell et al, 1995). Next is to create a maxBlue column and a Blue2GreenRatio column, followed by a log tranforming the blue2green ratio and the chlorophyll data. These will go in a new DataFrame. I'll keep the id column as well in case I need to refer back to the original dataset.
dfNomadV2['maxBlue'] = dfNomadV2.loc[:, ['rrs443','rrs489','rrs510']].max(axis=1)
dfNomadV2['blue2green'] = dfNomadV2.maxBlue / dfNomadV2.rrs555
dfLogOCx = pd.DataFrame(columns=['id','mxBl2Gr', 'chl',])
dfLogOCx['id'] = dfNomadV2.id
dfLogOCx['mxBl2Gr'] = np.log10(dfNomadV2.blue2green.values)
dfLogOCx['chl'] = np.log10(dfNomadV2.chl_all.values)
And let's inspect the data again
dfLogOCx.head()
dfLogOCx.describe()
dfLogOCx.info()
Pair plotting just the data relevant to the upcoming regression exercise...
sb.set(font_scale=1.5)
g = sb.PairGrid(dfLogOCx, vars=['chl', 'mxBl2Gr'],size=5, diag_sharey=False);
g = g.map_upper(pl.scatter,alpha=0.5)
g = g.map_diag(sb.kdeplot, lw=3)
g = g.map_lower(sb.kdeplot, cmap="Purples_d");
#g.axes[0,0].set_xticks(np.arange(-0.5,1.5,0.5));
Now fitting the data to a 4th order polynomial regression and printing the coefficents:
myCoefs = FitPoly(dfLogOCx.mxBl2Gr.values,dfLogOCx.chl,lin=False)
print(myCoefs[::-1])
Let's first confirm the numbers make sense by plotting modeled chl to field data:
mdlFreq= np.poly1d(myCoefs) # encapsulate the fit results into a polynomial object
chlMdlFreq = mdlFreq(dfLogOCx.mxBl2Gr) # evaluate polynomial with input data
Out of curiosity I'm also going to evaluate the use of OC4v6 coefficients:
$a_i$ = [0.3272, -2.9940, 2.7218, -1.2259, -0.5683]
OC4v6_coeffs = [0.3272, -2.9940, 2.7218, -1.2259, -0.5683]
mdlOC4v6 = np.poly1d(OC4v6_coeffs[::-1])
chlMdlOc4v6 = mdlOC4v6(dfLogOCx.mxBl2Gr)
Plotting both OLS yields...
f, axs = pl.subplots(ncols=2,figsize=(12, 6), sharey=True)
for i, (ax, res, name) in enumerate(zip(axs, [chlMdlFreq, chlMdlOc4v6],
['$4^{th}$Order Poly. Reg.', 'OC4v6'])):
ax.plot(res, dfLogOCx.chl, 'ko', alpha=0.5);
ax.plot([-2,2],[-2,2], 'r--',linewidth=2,label='1:1')
ax.set_xlabel(name, fontsize=16)
ax.set_xlim((-2,2))
if i == 0:
ax.set_ylabel(r'$ log_{10}$ (measured chl)', fontsize=18)
Metrics for my freq. model
print("mse, this fit: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlFreq))
print(r"r^2, this fit: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlFreq))
print("mse, OC4v6: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlOc4v6))
print(r"r^2, OC4v6: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlOc4v6))
Bayesian linear regression
For now, I am simply going to assume I'm still looking for a $4^{th}$ order polynomial. Model comparison will be the subject of a subsequent post. First I have to specify some priors around the coefficients. I'll assume weakly informative gaussian priors. Note that using gaussian priors on the coefficients is the bayesian version of ridge regression, and is therefore inherently more robust.
Before deciding an appropriate distribution for the likelihood, I will check the normality of the data by way of a normality plot, as well as a comparison between the cummulative distribution function (CDF) of the data and a normal standard distribution.
normData = norm.rvs(dfLogOCx.chl.mean(),dfLogOCx.chl.std(),
size=dfLogOCx.chl.size)
normDataSort = np.sort(normData)
logChlSort = np.sort(dfLogOCx.chl.values)
f,ax = pl.subplots(ncols=2,figsize=(13,6))
lbl = r'$log_{10}(chl)$'
ax[0].plot(normDataSort, logChlSort,'ko', alpha=0.5)
ax[0].set_xlabel('%s' %lbl, fontsize=16)
ax[0].set_ylabel('normal standard values');
ax[0].set_title('normality plot')
ax[1].plot(logChlSort, np.linspace(0,1,logChlSort.size), label='obs. data')
ax[1].plot(normDataSort,np.linspace(0,1,normDataSort.size),'r--',
label='normal standard values')
ax[1].set_ylabel('CDF')
ax[1].legend(loc='best');
The above suggests that a normal distribution might be a good candidate to apply to the likelihood. Next is to setup the bayesian model. Note the input data is cast as a theano shared variable. This is not necessary to run the model below, but makes it easy to run out-of-sample tests during validation (see further below).
ocxRatio = dfLogOCx.mxBl2Gr
chlObs = dfLogOCx.chl
ocxRatioShrd = shared(ocxRatio.values)
with pm.Model() as model_OCx_norm:
a0 = pm.Normal('a0', mu=0, sd=10)
a1 = pm.Normal('a1', mu=0, sd=10)
a2 = pm.Normal('a2', mu=0, sd=10)
a3 = pm.Normal('a3', mu=0, sd=10)
a4 = pm.Normal('a4', mu=0, sd=10)
epsilon = pm.Uniform('epsilon', lower=0, upper=10)
mu = a0 + a1 * ocxRatioShrd + a2 * ocxRatioShrd**2 + a3 * ocxRatioShrd**3 + a4 * ocxRatioShrd**4
chlPred = pm.Normal('chlPred', mu=mu, sd=epsilon, observed=chlObs)
start = pm.find_MAP()
step = pm.NUTS(scaling=start)
traceOCx_norm = pm.sample(10000, step=step, start=start)
Below I do away with the first 1000 points of the trace for burn-in. The marginals of the posteriors from the rest of the mcmc sampling is plotted below, alongside the trace for each coefficient to verify that convergence was achieved. The OC4v6 coefficients are plotted for reference. The OLS performed here on this dataset would correspond to the mean of these mostly normal distributions.
chainOCx_norm = traceOCx_norm[1000:]
varnames=['a%d' %d for d in range(5)]
varnames.append('epsilon')
refvals = {'a%d' %d: rv for d,rv in zip(range(5), OC4v6_coeffs) }
pm.traceplot(traceOCx_norm,varnames=varnames,lines=refvals);
The plots above suggest a good convergence. Another diagnostic is the autocorrelation plots from the trace series, which confirm a good convergence.
pm.autocorrplot(chainOCx_norm, varnames=varnames);
Marginals of the Posterior Distributions¶
Next is to examine posteriors. This can be done in a number of ways for better insight into the regression. Below is a tabulated summary followed by visualizations of the marginals. These include a forestplot that allows a direct comparison between the coefficient estimates, and a posterior plot that allows examination of the details of the distribution of each coefficient's marginal posterior.
pm.df_summary(chainOCx_norm, varnames=varnames)
pm.forestplot(chainOCx_norm,varnames=varnames);
pm.plot_posterior(chainOCx_norm, varnames=varnames, alpha=0.6);
The main takeaway from the above is the comparatively wide distributions of a3 and a4, implying larger uncertainty, but more importantly that both distributions's 95% Highest probability densities straddle the zero line. I'll explore some more before concluding.
Covariation among model parameters¶
To get a sense of how much more information each regression coefficient contributes to the larger picture, we can pairwise plot coefficient relationship.
df_trace = pm.trace_to_dataframe(traceOCx_norm,
varnames=['a%d' %d for d in range(5)])
g = sb.PairGrid(df_trace, diag_sharey=False)
g = g.map_diag(sb.kdeplot)
g = g.map_lower(pl.scatter, alpha=0.5)
There is often a weak correlation among most parameters, which is not all that unusual. However, a relatively strong correlation can be seen between a3, and a4. This, in addition to the previous observations about a3 and a4 indicate that the use of a fourth order polynomial adds little information to the prediction effort; A third order may ultimately generalize better. That will be the subject of a subsequent post on model comparison.
Posterior Predictive Checks¶
One concern with any model is to validate it. Posterior predictive checks allow to do just that by sampling from the posterior and see how well it compares with the data at hand. Like so...
sig0, sig1 = GetPPC(data=ocxRatio, mcChain=chainOCx_norm, pmModel=model_OCx_norm)
bayesCoeffs = [chainOCx_norm['a%d' %i].mean() for i in range(5)]
ax = PlotPPC(ocxRatio, chlObs, sig0, sig1, bayesCoeffs, oc4V6Coeffs=OC4v6_coeffs, myOLS_coeffs=myCoefs)
The 50 and 95% highest probability density intervals (HPD) show where we expect to see 50 and 95% of future data. Let's calculate metrics for this model. The majority of the training data falls inside that interval. But there are some outliers (to this model), particularly when chlorophyll concentration is above $ 1 \ mg m^{-3}$. I'll investigate this point in another post.
chlBayesReal=0
for i in range(5):
chlBayesReal += chainOCx_norm['a%d' %i].mean()*ocxRatio**i
print("mse, this fit: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlFreq))
print("r^2, this fit: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlFreq))
print("mse, OC4v6: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlOc4v6))
print("r^2, OC4v6: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlOc4v6))
print("mse, Bayes chl: %.2f" % mean_squared_error(dfLogOCx.chl.values, chlBayesReal))
print("r^2, Bayes chl: %.2f" % r2_score(dfLogOCx.chl.values, chlBayesReal))
So frequentist diagnostiscs show the Bayesian model fit matches the least squares fit on training data, as it should since I used uninformative priors. This is not sufficient to validate a model however, as the model has not yet been tested on new data.
Predicting Outcome from New Data¶
A machine learning practitioner who what (s)he's doing will want to apply the model to data that has not been used in the fitting process to assess the model's skill in real world predictions. The advantage in this particular case however is that in addition to prediction, we'll get uncertainties, not really available off the shelf with scikit-learn or other machine learning libraries. Even better, we can use the now familiar posterior predictive checks approach to do just that.
As a reminder I framed the input data as a theano shared variable to allow replacing the training data with the test data. The test data consists in SeaWiFS matchup data that originates, like NOMAD, from SeaBASS. However, the data in this set was gathered after NOMAD was put together (2010 on) to avoid data leakage.
Another interesting aspect of this particular test data set is that because it's a matchup set it contains corresponding in situ and satellite measurements, which makes for an interesting performance evaluation of the model.
# load test set
dfTest = pd.read_pickle('./bayesianChl_DATA/dfSwfChloraMups2010p.pkl')
dfTest.info()
dfLogTest = pd.DataFrame(columns=['id','seawifs_mxBl2Gr','insitu_chlor_a', 'seawifs_chlor_a'])
dfLogTest.id = dfTest.id
dfLogTest.seawifs_mxBl2Gr= np.log10(dfTest.seawifs_mxBl2Gr)
dfLogTest.insitu_chlor_a = np.log10(dfTest.insitu_chlor_a)
dfLogTest.seawifs_chlor_a = np.log10(dfTest.seawifs_chlor_a)
# SUBSTITUTING TRAINING SET FOR TEST SET - CHANGING VALUES HERE WILL ALSO CHANGE VALUES IN THE MODEL
ocxRatioShrd.set_value(dfLogTest.seawifs_mxBl2Gr.values)
sig0, sig1 = GetPPC(dfLogTest.seawifs_mxBl2Gr,mcChain=chainOCx_norm, pmModel=model_OCx_norm)
bayesCoeffs = [chainOCx_norm['a%d' %i].mean() for i in range(5)]
ax = PlotPPC(dfLogTest.seawifs_mxBl2Gr, dfLogTest.insitu_chlor_a,
sig0, sig1, bayesCoeffs, oc4V6Coeffs=OC4v6_coeffs, myOLS_coeffs=myCoefs)
chlMdlFreq = mdlFreq(dfLogTest.seawifs_mxBl2Gr) # evaluate polynomial with input data
chlMdlOc4v6 = mdlOC4v6(dfLogTest.seawifs_mxBl2Gr)
chlBayesReal=0
for i in range(5):
chlBayesReal += chainOCx_norm['a%d' %i].mean()*dfLogTest.seawifs_mxBl2Gr**i
print("mse, this fit: %.2f " % mean_squared_error(dfLogTest.insitu_chlor_a.values, chlMdlFreq))
print("r^2, this fit: %.2f" % r2_score(dfLogTest.insitu_chlor_a.values, chlMdlFreq))
print("mse, OC4v6: %.2f " % mean_squared_error(dfLogTest.insitu_chlor_a.values, chlMdlOc4v6))
print("r^2, OC4v6: %.2f" % r2_score(dfLogTest.insitu_chlor_a.values, chlMdlOc4v6))
print("mse, Bayes chl: %.2f" % mean_squared_error(dfLogTest.insitu_chlor_a.values, chlBayesReal))
print("r^2, Bayes chl: %.2f" % r2_score(dfLogTest.insitu_chlor_a.values, chlBayesReal))
Again, frequentist diagnostics are equivalent for frequentist OLS and Bayesian regressions, with the usually observed higher performance observed for training data relative to the out-of-sample test data.
However, the point of the post was to demonstrate that the Bayesian approach can uncover a wealth of information about the components of the model and the model fit itself, not the least of which are naturally obtained model uncertainties. These, I have expressed as credibility intervals around the coefficients, and the fit. That's it for this post. But that's not it for this dataset. The diagnostics displayed above have suggested that this model may be overly complex, and thereby prone to overfitting. In a subsequent post, I'll test a simpler model and use it as an excuse to explore the mysterious topic of model comparison.
Happy hacking, probabilistic or otherwise!
Comments