This post describes the use of bayesian polynomial regression to estimate chlorophyll from remote sensing reflectance; the output is contrasted to that obtained via the frequentist ordinary least squares regression.

Chlorophyll is estimated from ocean color remote sensing data using one of two main types of algorithms; semi-analytical or empirical. The latter is a polynomial model, where the input is a ratio of bands and the coefficients are obtained via ordinary least squares fitting. These models are usually more successful than their semi-analytical counterparts and as a result are at the forefront of the operational algorithmic arsenal used by the Ocean Biology Processing Group at NASA Goddard. Like all models these suffer from a number of shortcomings. One of these is the lack of algorithm uncertainty, which makes it difficult to formulate some kind of confidence around the final product, chlorophyll. Moreover, data preparation, model training and model testing are also poorly documented. This lack of standardized algorithm development makes verification of previous results and duplication of procedure for algorithm development challenging. In this post, I demonstrate bayesian regression by focusing on the OC4 algorithm. The OC4 algorithm is part of the OCx family of empirical algorithms, which are generally expressed as follows: $$ log_{10}\left(chlor_a\right) = a_0 + \sum_{i=1}^{4}a_ilog_{10}\left(\frac{max\left(Rrs\left(\lambda_{blue}\right)\right)}{Rrs\left(\lambda_{green}\right)}\right)^i$$

The dataset used here is as close as possible to the dataset used for the calculation of the coefficients of the OC4 polynomial, the latest of which is OC4v6), obtained a few years ago using a frequentist least squares regression. I'll redo a least squares fit using the present dataset. I will then repeat the exercise using a bayesian regression approach, and explore the much richer information content of the inference results.

In [1]:

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import os
from scipy.stats import norm
import matplotlib.pyplot as pl
import pymc3 as pm
import pandas as pd
import seaborn as sb
import pickle
from sklearn.metrics import mean_squared_error, r2_score
from theano import shared

In [3]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

%matplotlib inline

Some helper functions:

In [4]:

def FitPoly(X,y, order=4, lin=False):
    """
    Numpy regression. Returns coeffs.
    kwargs:
        lin: specifies whether data is log transformed. Data is log transformed if not."""
    if lin:
        X = np.log10(X)
        y = np.log10(y)
    coeffs = np.polyfit(X,y,deg=order)
    return coeffs

def GetPPC(data, mcChain, pmModel):
    """
    Runs post predictive checks (PPC).
    Returns 50% and 95% Highest Probability Density (HPD) intervals of predicted chlorophyll.
    Inputs:
        data: log of band ratio, pandas Series object
        mcChain: pymc3 trace object, result of mcmc sampling
        pmModel: pymc3 model object.
    Outputs:
        sig0: 50% HPD interval of predicted log chlorophyll
        sig1: 95% HPD interval of predicted log chlorophyll
    """
    
    ppcTrain = pm.sample_ppc(mcChain, samples=1000, model=pmModel)
    # sample posterior predictive check for 50% and 95% density interval
    idx = np.argsort(data.values)
    sig0 = pm.hpd(ppcTrain['chlPred'], alpha=0.5)[idx] 
    sig1 = pm.hpd(ppcTrain['chlPred'], alpha=0.05)[idx]
    return sig0, sig1

def PlotPPC(data, chlObs, sig0, sig1, bayesCoeffs, oc4V6Coeffs=[], myOLSCoeffs=[],
           titleTail='on Training Set'):
    """
    Plots Posterior predictive checks, overlaid with 50% and 95% HPDs of predicted log chlorophyll
    Optionally adds result of OC4V6 estimation, and estimation from OLS computed in this notebook.
    Inputs:
        data: log of band ratio, pandas Series object
        chlObs: log of in situ observed chl, pandas Series object
        sig0: 50% HPD interval of predicted log chlorophyll
        sig1: 95% HPD interval of predicted log chlorophyll
        bayesCoeffs: point estimate of bayesian regression coefficients
        oc4v6Coeffs: historical OC4V6 coefficients
        myOLSCoeffs: frequentist OLS coefficients
    Outputs:
        axis handle to resulting plot.
    """
    chlBayes=0
    idx = np.argsort(data.values)
    ocxRatio_ord = data.values[idx]
    ocxRatioMock = np.linspace(data.min(),data.max(),100)
    for i,coeff in enumerate(bayesCoeffs):
        chlBayes += coeff*ocxRatioMock**i
    pl.figure(figsize=(16,10))
    pl.fill_between(ocxRatio_ord, sig0[:,0], sig0[:,1], color='k', alpha=1, label='50% HPD.')
    pl.fill_between(ocxRatio_ord, sig1[:,0], sig1[:,1], color='darkgray', alpha=0.5, label='95% HDP')
    pl.plot(data,chlObs, 'ko', alpha=0.5, markeredgecolor='k',  label='Observation')
    if len(oc4V6Coeffs)>0:
        oc4V6Modl=0
        for i, coef in enumerate(oc4V6Coeffs):
            oc4V6Modl += coef * ocxRatioMock**i
        pl.plot(ocxRatioMock,oc4V6Modl,'r', linewidth=3, linestyle='--', label='OC4v6')
    if len(myOLSCoeffs)>0:
        myChlModl=0
        for i,coeff in enumerate(myOLSCoeffs[::-1]):
            myChlModl+= coeff * ocxRatioMock**i
        pl.plot(ocxRatioMock,myChlModl,'m', linewidth=5, label='OLS')
    pl.plot(ocxRatioMock,chlBayes,'b', linewidth=2, label='Bayesian OC4')
    pl.xlabel(r'$log_{10}\frac{max(Rrs\left(\lambda_{blue}\right))}{Rrs\left(\lambda_{green}\right)}$', fontsize=20)
    pl.ylabel(r'$log_{10}\left(chlor_a\right)$', fontsize=20);
    pl.legend(fontsize=14);
    pl.title('Posterior Predictive Check %s' % titleTail);
    return pl.gca()

In [5]:

dfNomadV2 = pd.read_pickle('./bayesianChl_DATA/NOMADV2SWFClean.pkl')

Next is to inspect the data, which again, in contrast to the description found here, chlorophyll from fluorescence is included when hplc chlorophyll is not available.

In [6]:

dfNomadV2.describe()

Out[6]:

	id	depth	rrs411	rrs443	rrs489	rrs510	rrs555	rrs670	hplc_chl	fluo_chl	chl_all
count	1016.000000	1016.000000	1016.000000	1016.000000	1016.000000	1016.000000	1016.000000	1016.000000	417.000000	794.000000	1016.000000
mean	4598.373031	661.871063	0.004701	0.004743	0.005511	0.005304	0.005418	0.001571	5.348145	3.361652	4.042444
std	2318.828613	1462.690737	0.003984	0.003586	0.004028	0.004294	0.005260	0.002387	9.345286	6.713694	7.736291
min	1441.000000	0.000000	0.000051	0.000190	0.000367	0.000497	0.000417	0.000036	0.017000	0.031000	0.017000
25%	2118.750000	8.000000	0.002000	0.002338	0.003063	0.002900	0.001886	0.000200	0.642000	0.303723	0.349475
50%	3702.500000	25.000000	0.003400	0.003800	0.004695	0.003800	0.003262	0.000614	1.684000	1.041000	1.169500
75%	7036.250000	171.500000	0.006500	0.006200	0.006400	0.006100	0.007201	0.002000	5.389000	3.282500	3.752000
max	7831.000000	5707.000000	0.030600	0.027601	0.030200	0.032900	0.046600	0.027700	70.213300	77.864800	77.864800

In [7]:

dfNomadV2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1016 entries, 0 to 4458
Data columns (total 11 columns):
id          1016 non-null int32
depth       1016 non-null float64
rrs411      1016 non-null float64
rrs443      1016 non-null float64
rrs489      1016 non-null float64
rrs510      1016 non-null float64
rrs555      1016 non-null float64
rrs670      1016 non-null float64
hplc_chl    417 non-null float64
fluo_chl    794 non-null float64
chl_all     1016 non-null float64
dtypes: float64(10), int32(1)
memory usage: 91.3 KB

Looking at the immediately relevant data graphically:

In [8]:

sb.pairplot(dfNomadV2, vars=['rrs443', 'rrs489', 'rrs510', 'rrs555','chl_all']);

There's a strong collinearity particulary among neighboring bands. In a multivariate regression that might be a problem. Here though we use a ratio of the max of three adjacent bands (blue to blue-green) to the next band (green), so this is not necessarily a problem. The data appears mostly lognormally distributed (cf. also Campbell et al, 1995). Next is to create a maxBlue column and a Blue2GreenRatio column, followed by a log tranforming the blue2green ratio and the chlorophyll data. These will go in a new DataFrame. I'll keep the id column as well in case I need to refer back to the original dataset.

In [9]:

dfNomadV2['maxBlue'] = dfNomadV2.loc[:, ['rrs443','rrs489','rrs510']].max(axis=1)
dfNomadV2['blue2green'] = dfNomadV2.maxBlue / dfNomadV2.rrs555
dfLogOCx = pd.DataFrame(columns=['id','mxBl2Gr', 'chl',])
dfLogOCx['id'] = dfNomadV2.id
dfLogOCx['mxBl2Gr'] = np.log10(dfNomadV2.blue2green.values)
dfLogOCx['chl'] = np.log10(dfNomadV2.chl_all.values)

And let's inspect the data again

In [10]:

dfLogOCx.head()

Out[10]:

	id	mxBl2Gr	chl
0	1565	-0.267867	1.581950
1	1566	-0.220457	1.544192
2	1567	-0.268537	1.429914
3	1568	-0.279976	1.680879
4	1559	-0.167733	1.371991

In [11]:

dfLogOCx.describe()

Out[11]:

	id	mxBl2Gr	chl
count	1016.000000	1016.000000	1016.000000
mean	4598.373031	0.117382	0.089204
std	2318.828613	0.287944	0.702093
min	1441.000000	-0.353183	-1.769551
25%	2118.750000	-0.087723	-0.456596
50%	3702.500000	0.021409	0.068000
75%	7036.250000	0.255020	0.574263
max	7831.000000	0.982648	1.891341

In [12]:

dfLogOCx.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1016 entries, 0 to 4458
Data columns (total 3 columns):
id         1016 non-null int32
mxBl2Gr    1016 non-null float64
chl        1016 non-null float64
dtypes: float64(2), int32(1)
memory usage: 27.8 KB

Pair plotting just the data relevant to the upcoming regression exercise...

In [13]:

sb.set(font_scale=1.5)
g = sb.PairGrid(dfLogOCx, vars=['chl', 'mxBl2Gr'],size=5, diag_sharey=False);
g = g.map_upper(pl.scatter,alpha=0.5)
g = g.map_diag(sb.kdeplot, lw=3)
g = g.map_lower(sb.kdeplot, cmap="Purples_d");
#g.axes[0,0].set_xticks(np.arange(-0.5,1.5,0.5));

Now fitting the data to a 4th order polynomial regression and printing the coefficents:

In [14]:

myCoefs = FitPoly(dfLogOCx.mxBl2Gr.values,dfLogOCx.chl,lin=False)
print(myCoefs[::-1])

[ 0.24120683 -3.10789592  3.1769156  -0.76503841 -1.38077655]

Let's first confirm the numbers make sense by plotting modeled chl to field data:

In [15]:

mdlFreq= np.poly1d(myCoefs) # encapsulate the fit results into a polynomial object
chlMdlFreq = mdlFreq(dfLogOCx.mxBl2Gr) # evaluate polynomial with input data

Out of curiosity I'm also going to evaluate the use of OC4v6 coefficients:

$a_i$ = [0.3272, -2.9940, 2.7218, -1.2259, -0.5683]

In [16]:

OC4v6_coeffs = [0.3272, -2.9940, 2.7218, -1.2259, -0.5683]
mdlOC4v6 = np.poly1d(OC4v6_coeffs[::-1])
chlMdlOc4v6 = mdlOC4v6(dfLogOCx.mxBl2Gr)

Plotting both OLS yields...

In [17]:

f, axs = pl.subplots(ncols=2,figsize=(12, 6), sharey=True)
for i, (ax, res, name) in enumerate(zip(axs, [chlMdlFreq, chlMdlOc4v6],
                                       ['$4^{th}$Order Poly. Reg.', 'OC4v6'])):
    ax.plot(res, dfLogOCx.chl, 'ko', alpha=0.5);
    ax.plot([-2,2],[-2,2], 'r--',linewidth=2,label='1:1')
    ax.set_xlabel(name, fontsize=16)
    ax.set_xlim((-2,2))
    if i == 0:
        ax.set_ylabel(r'$ log_{10}$ (measured chl)', fontsize=18)

Metrics for my freq. model

In [18]:

print("mse, this fit: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlFreq))
print(r"r^2, this fit: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlFreq))
print("mse, OC4v6: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlOc4v6))
print(r"r^2, OC4v6: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlOc4v6))

mse, this fit: 0.08 
r^2, this fit: 0.83
mse, OC4v6: 0.09 
r^2, OC4v6: 0.82

Bayesian linear regression

For now, I am simply going to assume I'm still looking for a $4^{th}$ order polynomial. Model comparison will be the subject of a subsequent post. First I have to specify some priors around the coefficients. I'll assume weakly informative gaussian priors. Note that using gaussian priors on the coefficients is the bayesian version of ridge regression, and is therefore inherently more robust.

Before deciding an appropriate distribution for the likelihood, I will check the normality of the data by way of a normality plot, as well as a comparison between the cummulative distribution function (CDF) of the data and a normal standard distribution.

In [19]:

normData = norm.rvs(dfLogOCx.chl.mean(),dfLogOCx.chl.std(),
                    size=dfLogOCx.chl.size)
normDataSort = np.sort(normData)
logChlSort = np.sort(dfLogOCx.chl.values)
f,ax = pl.subplots(ncols=2,figsize=(13,6))
lbl = r'$log_{10}(chl)$'
ax[0].plot(normDataSort, logChlSort,'ko', alpha=0.5)
ax[0].set_xlabel('%s' %lbl, fontsize=16)
ax[0].set_ylabel('normal standard values');
ax[0].set_title('normality plot')
ax[1].plot(logChlSort, np.linspace(0,1,logChlSort.size), label='obs. data')
ax[1].plot(normDataSort,np.linspace(0,1,normDataSort.size),'r--', 
          label='normal standard values')
ax[1].set_ylabel('CDF')
ax[1].legend(loc='best');

The above suggests that a normal distribution might be a good candidate to apply to the likelihood. Next is to setup the bayesian model. Note the input data is cast as a theano shared variable. This is not necessary to run the model below, but makes it easy to run out-of-sample tests during validation (see further below).

In [20]:

ocxRatio = dfLogOCx.mxBl2Gr
chlObs = dfLogOCx.chl
ocxRatioShrd = shared(ocxRatio.values)

In [21]:

with pm.Model() as model_OCx_norm:
    a0 = pm.Normal('a0', mu=0, sd=10)
    a1 = pm.Normal('a1', mu=0, sd=10)
    a2 = pm.Normal('a2', mu=0, sd=10)
    a3 = pm.Normal('a3', mu=0, sd=10)
    a4 = pm.Normal('a4', mu=0, sd=10)
    epsilon = pm.Uniform('epsilon', lower=0, upper=10)
    mu = a0 + a1 * ocxRatioShrd + a2 * ocxRatioShrd**2 + a3 * ocxRatioShrd**3 + a4 * ocxRatioShrd**4
    chlPred = pm.Normal('chlPred', mu=mu, sd=epsilon, observed=chlObs)
    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    traceOCx_norm = pm.sample(10000, step=step, start=start)

Optimization terminated successfully.
         Current function value: 203.732359
         Iterations: 19
         Function evaluations: 28
         Gradient evaluations: 28

100%|██████████| 10000/10000 [01:45<00:00, 95.10it/s]

Below I do away with the first 1000 points of the trace for burn-in. The marginals of the posteriors from the rest of the mcmc sampling is plotted below, alongside the trace for each coefficient to verify that convergence was achieved. The OC4v6 coefficients are plotted for reference. The OLS performed here on this dataset would correspond to the mean of these mostly normal distributions.

In [22]:

chainOCx_norm = traceOCx_norm[1000:]
varnames=['a%d' %d for d in range(5)]
varnames.append('epsilon')
refvals = {'a%d' %d: rv for d,rv in zip(range(5), OC4v6_coeffs) }
pm.traceplot(traceOCx_norm,varnames=varnames,lines=refvals);

The plots above suggest a good convergence. Another diagnostic is the autocorrelation plots from the trace series, which confirm a good convergence.

In [24]:

pm.autocorrplot(chainOCx_norm, varnames=varnames);

Marginals of the Posterior Distributions¶

Next is to examine posteriors. This can be done in a number of ways for better insight into the regression. Below is a tabulated summary followed by visualizations of the marginals. These include a forestplot that allows a direct comparison between the coefficient estimates, and a posterior plot that allows examination of the details of the distribution of each coefficient's marginal posterior.

In [23]:

pm.df_summary(chainOCx_norm, varnames=varnames)

Out[23]:

	mean	sd	mc_error	hpd_2.5	hpd_97.5
a0	0.241426	0.012607	0.000203	0.217354	0.266053
a1	-3.106241	0.089355	0.002127	-3.285018	-2.937692
a2	3.166631	0.322717	0.008059	2.547216	3.798498
a3	-0.754741	1.054019	0.032760	-2.892406	1.207604
a4	-1.382085	0.828614	0.026008	-3.011567	0.243918
epsilon	0.290975	0.006543	0.000072	0.278107	0.303435

In [25]:

pm.forestplot(chainOCx_norm,varnames=varnames);

In [26]:

pm.plot_posterior(chainOCx_norm, varnames=varnames, alpha=0.6);

The main takeaway from the above is the comparatively wide distributions of a3 and a4, implying larger uncertainty, but more importantly that both distributions's 95% Highest probability densities straddle the zero line. I'll explore some more before concluding.

Covariation among model parameters¶

To get a sense of how much more information each regression coefficient contributes to the larger picture, we can pairwise plot coefficient relationship.

In [28]:

df_trace = pm.trace_to_dataframe(traceOCx_norm, 
                                 varnames=['a%d' %d for d in range(5)])
g = sb.PairGrid(df_trace, diag_sharey=False)
g = g.map_diag(sb.kdeplot)
g = g.map_lower(pl.scatter, alpha=0.5)

There is often a weak correlation among most parameters, which is not all that unusual. However, a relatively strong correlation can be seen between a3, and a4. This, in addition to the previous observations about a3 and a4 indicate that the use of a fourth order polynomial adds little information to the prediction effort; A third order may ultimately generalize better. That will be the subject of a subsequent post on model comparison.

Posterior Predictive Checks¶

One concern with any model is to validate it. Posterior predictive checks allow to do just that by sampling from the posterior and see how well it compares with the data at hand. Like so...

In [29]:

sig0, sig1 = GetPPC(data=ocxRatio, mcChain=chainOCx_norm, pmModel=model_OCx_norm)

100%|██████████| 1000/1000 [00:17<00:00, 56.44it/s]

In [30]:

bayesCoeffs = [chainOCx_norm['a%d' %i].mean() for i in range(5)]
ax = PlotPPC(ocxRatio, chlObs, sig0, sig1, bayesCoeffs, oc4V6Coeffs=OC4v6_coeffs, myOLS_coeffs=myCoefs)

The 50 and 95% highest probability density intervals (HPD) show where we expect to see 50 and 95% of future data. Let's calculate metrics for this model. The majority of the training data falls inside that interval. But there are some outliers (to this model), particularly when chlorophyll concentration is above $ 1 \ mg m^{-3}$. I'll investigate this point in another post.

In [31]:

chlBayesReal=0
for i in range(5):
    chlBayesReal += chainOCx_norm['a%d' %i].mean()*ocxRatio**i
print("mse, this fit: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlFreq))
print("r^2, this fit: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlFreq))
print("mse, OC4v6: %.2f " % mean_squared_error(dfLogOCx.chl.values, chlMdlOc4v6))
print("r^2, OC4v6: %.2f" % r2_score(dfLogOCx.chl.values, chlMdlOc4v6))
print("mse, Bayes chl: %.2f" % mean_squared_error(dfLogOCx.chl.values, chlBayesReal))
print("r^2, Bayes chl: %.2f" % r2_score(dfLogOCx.chl.values, chlBayesReal))

mse, this fit: 0.08 
r^2, this fit: 0.83
mse, OC4v6: 0.09 
r^2, OC4v6: 0.82
mse, Bayes chl: 0.08
r^2, Bayes chl: 0.83

So frequentist diagnostiscs show the Bayesian model fit matches the least squares fit on training data, as it should since I used uninformative priors. This is not sufficient to validate a model however, as the model has not yet been tested on new data.

Predicting Outcome from New Data¶

A machine learning practitioner who what (s)he's doing will want to apply the model to data that has not been used in the fitting process to assess the model's skill in real world predictions. The advantage in this particular case however is that in addition to prediction, we'll get uncertainties, not really available off the shelf with scikit-learn or other machine learning libraries. Even better, we can use the now familiar posterior predictive checks approach to do just that.

As a reminder I framed the input data as a theano shared variable to allow replacing the training data with the test data. The test data consists in SeaWiFS matchup data that originates, like NOMAD, from SeaBASS. However, the data in this set was gathered after NOMAD was put together (2010 on) to avoid data leakage.

Another interesting aspect of this particular test data set is that because it's a matchup set it contains corresponding in situ and satellite measurements, which makes for an interesting performance evaluation of the model.

In [32]:

# load test set
dfTest = pd.read_pickle('./bayesianChl_DATA/dfSwfChloraMups2010p.pkl')
dfTest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 12 columns):
id                   59 non-null int64
latitude             59 non-null float64
longitude            59 non-null float64
date_time            59 non-null datetime64[ns]
cruise               59 non-null object
seawifs_solz         59 non-null float64
seawifs_senz         59 non-null float64
seawifs_windspeed    59 non-null float64
seawifs_tdiff        59 non-null int64
seawifs_mxBl2Gr      59 non-null float64
insitu_chlor_a       59 non-null float64
seawifs_chlor_a      59 non-null float64
dtypes: datetime64[ns](1), float64(8), int64(2), object(1)
memory usage: 5.6+ KB

In [33]:

dfLogTest = pd.DataFrame(columns=['id','seawifs_mxBl2Gr','insitu_chlor_a', 'seawifs_chlor_a'])
dfLogTest.id = dfTest.id
dfLogTest.seawifs_mxBl2Gr=  np.log10(dfTest.seawifs_mxBl2Gr)
dfLogTest.insitu_chlor_a = np.log10(dfTest.insitu_chlor_a)
dfLogTest.seawifs_chlor_a = np.log10(dfTest.seawifs_chlor_a)

# SUBSTITUTING TRAINING SET FOR TEST SET - CHANGING VALUES HERE WILL ALSO CHANGE VALUES IN THE MODEL
ocxRatioShrd.set_value(dfLogTest.seawifs_mxBl2Gr.values)

In [34]:

sig0, sig1 = GetPPC(dfLogTest.seawifs_mxBl2Gr,mcChain=chainOCx_norm, pmModel=model_OCx_norm)

100%|██████████| 1000/1000 [00:17<00:00, 58.08it/s]

In [35]:

bayesCoeffs = [chainOCx_norm['a%d' %i].mean() for i in range(5)]
ax = PlotPPC(dfLogTest.seawifs_mxBl2Gr, dfLogTest.insitu_chlor_a,
             sig0, sig1, bayesCoeffs, oc4V6Coeffs=OC4v6_coeffs, myOLS_coeffs=myCoefs)

In [36]:

chlMdlFreq = mdlFreq(dfLogTest.seawifs_mxBl2Gr) # evaluate polynomial with input data
chlMdlOc4v6 = mdlOC4v6(dfLogTest.seawifs_mxBl2Gr)
chlBayesReal=0
for i in range(5):
    chlBayesReal += chainOCx_norm['a%d' %i].mean()*dfLogTest.seawifs_mxBl2Gr**i
print("mse, this fit: %.2f " % mean_squared_error(dfLogTest.insitu_chlor_a.values, chlMdlFreq))
print("r^2, this fit: %.2f" % r2_score(dfLogTest.insitu_chlor_a.values, chlMdlFreq))
print("mse, OC4v6: %.2f " % mean_squared_error(dfLogTest.insitu_chlor_a.values, chlMdlOc4v6))
print("r^2, OC4v6: %.2f" % r2_score(dfLogTest.insitu_chlor_a.values, chlMdlOc4v6))
print("mse, Bayes chl: %.2f" % mean_squared_error(dfLogTest.insitu_chlor_a.values, chlBayesReal))
print("r^2, Bayes chl: %.2f" % r2_score(dfLogTest.insitu_chlor_a.values, chlBayesReal))

mse, this fit: 0.10 
r^2, this fit: 0.76
mse, OC4v6: 0.11 
r^2, OC4v6: 0.74
mse, Bayes chl: 0.10
r^2, Bayes chl: 0.76

Again, frequentist diagnostics are equivalent for frequentist OLS and Bayesian regressions, with the usually observed higher performance observed for training data relative to the out-of-sample test data.

However, the point of the post was to demonstrate that the Bayesian approach can uncover a wealth of information about the components of the model and the model fit itself, not the least of which are naturally obtained model uncertainties. These, I have expressed as credibility intervals around the coefficients, and the fit. That's it for this post. But that's not it for this dataset. The diagnostics displayed above have suggested that this model may be overly complex, and thereby prone to overfitting. In a subsequent post, I'll test a simpler model and use it as an excuse to explore the mysterious topic of model comparison.

Happy hacking, probabilistic or otherwise!

Bayesian Approach to Chlorophyll Estimation from Satellite Remote Sensing

Tags:

Marginals of the Posterior Distributions¶

Covariation among model parameters¶

Posterior Predictive Checks¶

Predicting Outcome from New Data¶

Marginals of the Posterior Distributions¶

Covariation among model parameters¶

Posterior Predictive Checks¶

Predicting Outcome from New Data¶

Comments