In this post, I'm going to briefly describe how a I download the NASA bio-Optical Marine Algorithm Dataset or NOMAD created for algorithm development, extract the data I need and store it all neatly in a Pandas DataFrame. Here I use the latest dataset, NOMAD v.2, created in 2008. First things first; imports!

In [1]:

import requests
import pandas as pd
import re
import numpy as np
import pickle
from IPython.core.display import display, HTML

In [2]:

display(HTML("<style>.container {width:90% !important;}</style>"))

The data can be accessed through a URL that I'll store in a string below.

In [3]:

NOMADV2url='https://seabass.gsfc.nasa.gov/wiki/NOMAD/nomad_seabass_v2.a_2008200.txt'

Next, I'll write a couple of functions. The first to get the data from the url. The second function will parse the text returned by the first function and put in a Pandas DataFrame. This second function makes more sense after inspecting the content of the page at the url above.

In [4]:

def GetNomad(url=NOMADV2url):
    """Download and return data as text"""
    resp = requests.get(NOMADV2url)
    content = resp.text.splitlines()    
    resp.close()
    return content

def ParseTextFile(textFile, topickle=False, convert2DateTime=False, **kwargs):
    """
    * topickle: pickle resulting DataFrame if True
    * convert2DateTime: join date/time columns and convert entries to datetime objects
    * kwargs:
        pkl_fname: pickle file name to save DataFrame by, if topickle=True
    """
    # Pre-compute some regex
    columns = re.compile('^/fields=(.+)') # to get field/column names
    units = re.compile('^/units=(.+)') # to get units -- optional
    endHeader = re.compile('^/end_header') # to know when to start storing data
    # Set some milestones
    noFields = True
    getData = False
    # loop through the text data
    for line in textFile:
        if noFields:
            fieldStr = columns.findall(line)
            if len(fieldStr)>0:
                noFields = False
                fieldList = fieldStr[0].split(',')
                dataDict = dict.fromkeys(fieldList)
                continue # nothing left to do with this line, keep looping
        if not getData:
            if endHeader.match(line):
                # end of header reached, start acquiring data
                getData = True 
        else:
            dataList = line.split(',')
            for field,datum in zip(fieldList, dataList):
                if not dataDict[field]:
                    dataDict[field] = []
                dataDict[field].append(datum)
    df = pd.DataFrame(dataDict, columns=fieldList)
    if convert2DateTime:
        datetimelabels=['year', 'month', 'day', 'hour', 'minute', 'second']
        df['Datetime']= pd.to_datetime(df[datetimelabels],
                                       format='%Y-%m-%dT%H:%M:%S')
        df.drop(datetimelabels, axis=1, inplace=True)
    if topickle:
        fname=kwargs.pop('pkl_fname', 'dfNomad2.pkl')
        df.to_pickle(fname)
    return df

In [6]:

df = ParseTextFile(GetNomad(), topickle=True, convert2DateTime=True,
                  pkl_fname='./bayesianChl_DATA/dfNomadRaw.pkl')

In [7]:

df.head()

Out[7]:

	lat	lon	id	oisst	etopo2	chl	chl_a	kd405	kd411	kd443	...	diato	lut	zea	chl_b	beta-car	alpha-car	alpha-beta-car	flag	cruise	Datetime
0	38.4279	-76.61	1565	3.7	0.0	38.19	-999	-999	3.9455	3.1457	...	-999	-999	-999	-999	-999	-999	-999	20691	ace0301	2003-04-15 15:15:00
1	38.368	-76.5	1566	3.7	0.0	35.01	-999	-999	2.5637	2.0529	...	-999	-999	-999	-999	-999	-999	-999	20675	ace0301	2003-04-15 16:50:00
2	38.3074	-76.44	1567	3.7	1	26.91	-999	-999	2.1533	1.7531	...	-999	-999	-999	-999	-999	-999	-999	20691	ace0301	2003-04-15 17:50:00
3	38.6367	-76.32	1568	3.7	3	47.96	-999	-999	2.69	2.2985	...	-999	-999	-999	-999	-999	-999	-999	20675	ace0301	2003-04-17 18:15:00
4	38.3047	-76.44	1559	22.03	1	23.55	-999	-999	3.095	2.3966	...	-999	-999	-999	-999	-999	-999	-999	20691	ace0302	2003-07-21 18:27:00

5 rows × 212 columns

This DataFrame quite large and unwieldy with 212 columns. But Pandas makes it easy to extract the necessary data for a particular project. For my current project, which I'll go over in a subsequent post, I need field data relevant to the SeaWiFS sensor, in particular optical data at wavelengths 412, 443, 490, 510, 555, and 670 nm. First let's look at the available bands as they appear in spectral surface irradiance column labels, which start with 'es'.

In [8]:

bandregex = re.compile('es([0-9]+)')
bands = bandregex.findall(''.join(df.columns))
print(bands)

['405', '411', '443', '455', '465', '489', '510', '520', '530', '550', '555', '560', '565', '570', '590', '619', '625', '665', '670', '683']

Now I can extract data with bands that are the closest to what I need. In the process I'm going to use water leaving radiance and spectral surface irradiance to compute remote sensing reflectance, rrs. I will store this new data in a new DataFrame, dfSwf.

In [9]:

swfBands = ['411','443','489','510','555','670']
dfSwf = pd.DataFrame(columns=['rrs%s' % b for b in swfBands])
for b in swfBands:
    dfSwf.loc[:,'rrs%s'%b] = df.loc[:,'lw%s' % b].astype('f8') / df.loc[:,'es%s' % b].astype('f8')

In [10]:

dfSwf.head()

Out[10]:

	rrs411	rrs443	rrs489	rrs510	rrs555	rrs670
0	0.001204	0.001686	0.003293	0.004036	0.007479	0.003465
1	0.001062	0.001384	0.002173	0.002499	0.004152	0.001695
2	0.000971	0.001185	0.001843	0.002288	0.004246	0.001612
3	0.001472	0.001741	0.002877	0.003664	0.006982	0.003234
4	0.000905	0.001022	0.001506	0.001903	0.002801	0.001791

For the projects I'm currently working on, I'll need to select a few more features from the inital dataset.

In [11]:

dfSwf['id'] = df.id.astype('i4') # in case I need to relate this data to the original
dfSwf['datetime'] = df.Datetime
dfSwf['hplc_chl'] = df.chl_a.astype('f8')
dfSwf['fluo_chl'] = df.chl.astype('f8')
dfSwf['lat'] = df.lat.astype('f8')
dfSwf['lon'] = df.lon.astype('f8')
dfSwf['depth'] = df.etopo2.astype('f8')
dfSwf['sst'] = df.oisst.astype('f8')
for band in swfBands:
    addprods=['a','ad','ag','ap','bb']
    for prod in addprods:
        dfSwf['%s%s' % (prod,band)] = df['%s%s' % (prod, band)].astype('f8')
dfSwf.replace(-999,np.nan, inplace=True)

Tallying the features I've gathered...

In [12]:

print(dfSwf.columns)

Index(['rrs411', 'rrs443', 'rrs489', 'rrs510', 'rrs555', 'rrs670', 'id',
       'datetime', 'hplc_chl', 'fluo_chl', 'lat', 'lon', 'depth', 'sst',
       'a411', 'ad411', 'ag411', 'ap411', 'bb411', 'a443', 'ad443', 'ag443',
       'ap443', 'bb443', 'a489', 'ad489', 'ag489', 'ap489', 'bb489', 'a510',
       'ad510', 'ag510', 'ap510', 'bb510', 'a555', 'ad555', 'ag555', 'ap555',
       'bb555', 'a670', 'ad670', 'ag670', 'ap670', 'bb670'],
      dtype='object')

That seems like a good dataset to start with. I'll pickle this DataFrame just in case.

In [14]:

dfSwf.to_pickle('./bayesianChl_DATA/dfNomadSWF.pkl')

The first project that I'll first tackle is a recasting of the OCx empirical band ratio algorithms within a Bayesian framework. For that I can further cull the dataset following the "Data Source" section in a paper I am using for comparison by Hu et al., 2012. This study draws from this same data set, applying the following criteria:

only hplc chlorophyll
chl>0 where rrs>0
depth>30
lat $\in\left[-60,60\right]$

Applying these criteria should result in a dataset reduced to 136 observations.

In [13]:

rrsCols = [col for col in dfSwf.columns if 'rrs' in col]
iwantcols=rrsCols + ['id', 'depth','hplc_chl','sst','lat','lon']
dfSwfHu = dfSwf[iwantcols].copy()
del dfSwf, df

In [14]:

dfSwfHu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Data columns (total 12 columns):
rrs411      4459 non-null float64
rrs443      4459 non-null float64
rrs489      4459 non-null float64
rrs510      4459 non-null float64
rrs555      4459 non-null float64
rrs670      4459 non-null float64
id          4459 non-null int32
depth       4459 non-null float64
hplc_chl    1381 non-null float64
sst         4459 non-null float64
lat         4459 non-null float64
lon         4459 non-null float64
dtypes: float64(11), int32(1)
memory usage: 400.7 KB

Apparently the only null entries are in the hplc_chl column. Dropping the nulls in that column takes care of the first of the criteria listed above.

In [15]:

dfSwfHu.dropna(inplace=True)

In [16]:

dfSwfHu.describe()

Out[16]:

	rrs411	rrs443	rrs489	rrs510	rrs555	rrs670	id	depth	hplc_chl	sst	lat	lon
count	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000	1381.000000
mean	0.107953	0.005607	0.005228	0.120503	0.103273	0.585099	5175.859522	1936.900072	2.285293	19.159754	11.752954	-53.511340
std	0.302128	0.003567	0.002890	0.319626	0.298920	0.492136	2161.341423	1998.475771	5.752391	7.629313	32.350239	60.334355
min	-0.000288	0.000190	0.000422	0.000304	0.000218	0.000000	644.000000	0.000000	0.017000	-1.460000	-64.418600	-177.004000
25%	0.002788	0.002884	0.003346	0.003098	0.001663	0.001184	2853.000000	39.000000	0.145000	15.090000	-10.776700	-88.669400
50%	0.005102	0.004763	0.004880	0.003804	0.002305	1.000000	6181.000000	753.000000	0.538000	20.080000	29.842400	-63.852000
75%	0.009592	0.007798	0.006380	0.005700	0.005715	1.000000	6796.000000	3992.000000	1.694000	25.450000	34.298000	-21.500800
max	1.000000	0.027601	0.025900	1.000000	1.000000	1.000000	7765.000000	6041.000000	70.213300	30.760000	54.000300	173.920000

According to the summary table above, I don't need to worry about 0 chl as per the criteria above. However, it appears several reflectances have spurious 1.0000 values. Since these were never mentioned in the paper, I'll first cull the dataset according to depth and lat criteria, see if that takes care of cleaning those values as well. This should land me with 136 observations

In [17]:

dfSwfHu=dfSwfHu.loc[((dfSwfHu.depth>30) &\
                     (dfSwfHu.lat>=-60) & (dfSwfHu.lat<=60)),:]

In [18]:

dfSwfHu.describe()

Out[18]:

	rrs411	rrs443	rrs489	rrs510	rrs555	rrs670	id	depth	hplc_chl	sst	lat	lon
count	964.000000	964.000000	964.000000	964.000000	964.000000	964.000000	964.000000	964.000000	964.000000	964.000000	964.000000	964.000000
mean	0.152406	0.005741	0.004797	0.161636	0.145059	0.705510	5347.700207	2552.714730	1.094060	19.800280	11.748191	-42.971192
std	0.352490	0.003378	0.002059	0.364331	0.349634	0.455923	2059.463026	1963.955594	3.129181	6.061552	28.743822	67.950684
min	-0.000212	0.000190	0.000422	0.000304	0.000218	0.000000	644.000000	31.000000	0.017000	2.020000	-59.756300	-177.004000
25%	0.003000	0.002900	0.003256	0.002952	0.001592	0.000598	4260.750000	338.500000	0.114000	15.545000	-12.503000	-117.247750
50%	0.006208	0.005254	0.004719	0.003620	0.001988	1.000000	6198.500000	3066.500000	0.265000	19.410000	23.902400	-39.868200
75%	0.011122	0.008080	0.006190	0.004477	0.003064	1.000000	6662.500000	4312.000000	0.936500	25.360000	34.252500	-17.495375
max	1.000000	0.016246	0.019676	1.000000	1.000000	1.000000	7760.000000	6041.000000	53.002700	30.180000	54.000300	173.920000

Nope. We're down to 964 observations. So much for reproducibility via publication. Getting rid of spurions rrs values...

In [19]:

dfSwfHu = dfSwfHu.loc[((dfSwfHu.rrs411<1.0) & (dfSwfHu.rrs510<1.0)&\
                               (dfSwfHu.rrs555<1.0) & (dfSwfHu.rrs670<1.0)),:]

In [20]:

dfSwfHu.describe()

Out[20]:

	rrs411	rrs443	rrs489	rrs510	rrs555	rrs670	id	depth	hplc_chl	sst	lat	lon
count	136.000000	136.000000	136.000000	136.000000	136.000000	136.000000	136.000000	136.000000	136.000000	136.000000	136.000000	136.000000
mean	0.005336	0.004684	0.004071	0.003217	0.002535	0.000594	6240.088235	2155.500000	1.942732	21.763382	12.399556	-72.949479
std	0.005361	0.003904	0.002099	0.001401	0.001920	0.001094	1922.935381	2018.518092	6.550881	6.950208	25.752077	52.987492
min	0.000051	0.000190	0.000422	0.000497	0.000639	0.000000	2640.000000	31.000000	0.017000	5.260000	-35.164400	-170.045000
25%	0.001404	0.001766	0.002409	0.002365	0.001568	0.000094	5903.750000	64.000000	0.145750	16.380000	-1.261000	-90.375800
50%	0.002839	0.002850	0.003435	0.003235	0.001857	0.000175	7226.500000	2809.500000	0.451500	25.625000	11.413400	-73.367600
75%	0.007855	0.007016	0.005809	0.003892	0.002625	0.000503	7314.000000	4305.750000	1.130750	27.290000	37.357600	-56.020225
max	0.022010	0.016246	0.009500	0.009600	0.012200	0.007900	7747.000000	5526.000000	53.002700	30.180000	43.619200	170.000000

136 values. Success! Once again, I'll pickle this DataFrame.

In [21]:

dfSwfHu.to_pickle('/accounts/ekarakoy/DATA/NOMAD/dfSwfHuOcxCI_2012.pkl')

That's it. Until next time, Happy Hacking!

Getting the NASA bio-Optical Marine Algorithm Dataset (NOMAD) into a Pandas DataFrame

Tags:

Comments