{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this post, I'm going to briefly describe how a I download the [NASA bio-Optical Marine Algorithm Dataset or NOMAD](https://seabass.gsfc.nasa.gov/wiki/NOMAD) created for algorithm development, extract the data I need and store it all neatly in a Pandas DataFrame. Here I use the latest dataset, NOMAD v.2, created in 2008.\n",
    "<!-- Teaser_End -->\n",
    "First things first; imports!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "import pandas as pd\n",
    "import re\n",
    "import numpy as np\n",
    "import pickle\n",
    "from IPython.core.display import display, HTML"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container {width:90% !important;}</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(HTML(\"<style>.container {width:90% !important;}</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data can be accessed through a URL that I'll store in a string below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "NOMADV2url='https://seabass.gsfc.nasa.gov/wiki/NOMAD/nomad_seabass_v2.a_2008200.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, I'll write a couple of functions. The first to get the data from the url. The second function will parse the text returned by the first function and put in a Pandas DataFrame. This second function makes more sense after inspecting the content of the page at the url above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def GetNomad(url=NOMADV2url):\n",
    "    \"\"\"Download and return data as text\"\"\"\n",
    "    resp = requests.get(NOMADV2url)\n",
    "    content = resp.text.splitlines()    \n",
    "    resp.close()\n",
    "    return content\n",
    "\n",
    "def ParseTextFile(textFile, topickle=False, convert2DateTime=False, **kwargs):\n",
    "    \"\"\"\n",
    "    * topickle: pickle resulting DataFrame if True\n",
    "    * convert2DateTime: join date/time columns and convert entries to datetime objects\n",
    "    * kwargs:\n",
    "        pkl_fname: pickle file name to save DataFrame by, if topickle=True\n",
    "    \"\"\"\n",
    "    # Pre-compute some regex\n",
    "    columns = re.compile('^/fields=(.+)') # to get field/column names\n",
    "    units = re.compile('^/units=(.+)') # to get units -- optional\n",
    "    endHeader = re.compile('^/end_header') # to know when to start storing data\n",
    "    # Set some milestones\n",
    "    noFields = True\n",
    "    getData = False\n",
    "    # loop through the text data\n",
    "    for line in textFile:\n",
    "        if noFields:\n",
    "            fieldStr = columns.findall(line)\n",
    "            if len(fieldStr)>0:\n",
    "                noFields = False\n",
    "                fieldList = fieldStr[0].split(',')\n",
    "                dataDict = dict.fromkeys(fieldList)\n",
    "                continue # nothing left to do with this line, keep looping\n",
    "        if not getData:\n",
    "            if endHeader.match(line):\n",
    "                # end of header reached, start acquiring data\n",
    "                getData = True \n",
    "        else:\n",
    "            dataList = line.split(',')\n",
    "            for field,datum in zip(fieldList, dataList):\n",
    "                if not dataDict[field]:\n",
    "                    dataDict[field] = []\n",
    "                dataDict[field].append(datum)\n",
    "    df = pd.DataFrame(dataDict, columns=fieldList)\n",
    "    if convert2DateTime:\n",
    "        datetimelabels=['year', 'month', 'day', 'hour', 'minute', 'second']\n",
    "        df['Datetime']= pd.to_datetime(df[datetimelabels],\n",
    "                                       format='%Y-%m-%dT%H:%M:%S')\n",
    "        df.drop(datetimelabels, axis=1, inplace=True)\n",
    "    if topickle:\n",
    "        fname=kwargs.pop('pkl_fname', 'dfNomad2.pkl')\n",
    "        df.to_pickle(fname)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = ParseTextFile(GetNomad(), topickle=True, convert2DateTime=True,\n",
    "                  pkl_fname='./bayesianChl_DATA/dfNomadRaw.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>id</th>\n",
       "      <th>oisst</th>\n",
       "      <th>etopo2</th>\n",
       "      <th>chl</th>\n",
       "      <th>chl_a</th>\n",
       "      <th>kd405</th>\n",
       "      <th>kd411</th>\n",
       "      <th>kd443</th>\n",
       "      <th>...</th>\n",
       "      <th>diato</th>\n",
       "      <th>lut</th>\n",
       "      <th>zea</th>\n",
       "      <th>chl_b</th>\n",
       "      <th>beta-car</th>\n",
       "      <th>alpha-car</th>\n",
       "      <th>alpha-beta-car</th>\n",
       "      <th>flag</th>\n",
       "      <th>cruise</th>\n",
       "      <th>Datetime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>38.4279</td>\n",
       "      <td>-76.61</td>\n",
       "      <td>1565</td>\n",
       "      <td>3.7</td>\n",
       "      <td>0.0</td>\n",
       "      <td>38.19</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>3.9455</td>\n",
       "      <td>3.1457</td>\n",
       "      <td>...</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>20691</td>\n",
       "      <td>ace0301</td>\n",
       "      <td>2003-04-15 15:15:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>38.368</td>\n",
       "      <td>-76.5</td>\n",
       "      <td>1566</td>\n",
       "      <td>3.7</td>\n",
       "      <td>0.0</td>\n",
       "      <td>35.01</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>2.5637</td>\n",
       "      <td>2.0529</td>\n",
       "      <td>...</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>20675</td>\n",
       "      <td>ace0301</td>\n",
       "      <td>2003-04-15 16:50:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38.3074</td>\n",
       "      <td>-76.44</td>\n",
       "      <td>1567</td>\n",
       "      <td>3.7</td>\n",
       "      <td>1</td>\n",
       "      <td>26.91</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>2.1533</td>\n",
       "      <td>1.7531</td>\n",
       "      <td>...</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>20691</td>\n",
       "      <td>ace0301</td>\n",
       "      <td>2003-04-15 17:50:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>38.6367</td>\n",
       "      <td>-76.32</td>\n",
       "      <td>1568</td>\n",
       "      <td>3.7</td>\n",
       "      <td>3</td>\n",
       "      <td>47.96</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>2.69</td>\n",
       "      <td>2.2985</td>\n",
       "      <td>...</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>20675</td>\n",
       "      <td>ace0301</td>\n",
       "      <td>2003-04-17 18:15:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>38.3047</td>\n",
       "      <td>-76.44</td>\n",
       "      <td>1559</td>\n",
       "      <td>22.03</td>\n",
       "      <td>1</td>\n",
       "      <td>23.55</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>3.095</td>\n",
       "      <td>2.3966</td>\n",
       "      <td>...</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>-999</td>\n",
       "      <td>20691</td>\n",
       "      <td>ace0302</td>\n",
       "      <td>2003-07-21 18:27:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 212 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       lat     lon    id  oisst etopo2    chl chl_a kd405   kd411   kd443  \\\n",
       "0  38.4279  -76.61  1565    3.7    0.0  38.19  -999  -999  3.9455  3.1457   \n",
       "1   38.368   -76.5  1566    3.7    0.0  35.01  -999  -999  2.5637  2.0529   \n",
       "2  38.3074  -76.44  1567    3.7      1  26.91  -999  -999  2.1533  1.7531   \n",
       "3  38.6367  -76.32  1568    3.7      3  47.96  -999  -999    2.69  2.2985   \n",
       "4  38.3047  -76.44  1559  22.03      1  23.55  -999  -999   3.095  2.3966   \n",
       "\n",
       "          ...         diato   lut   zea chl_b beta-car alpha-car  \\\n",
       "0         ...          -999  -999  -999  -999     -999      -999   \n",
       "1         ...          -999  -999  -999  -999     -999      -999   \n",
       "2         ...          -999  -999  -999  -999     -999      -999   \n",
       "3         ...          -999  -999  -999  -999     -999      -999   \n",
       "4         ...          -999  -999  -999  -999     -999      -999   \n",
       "\n",
       "  alpha-beta-car   flag   cruise            Datetime  \n",
       "0           -999  20691  ace0301 2003-04-15 15:15:00  \n",
       "1           -999  20675  ace0301 2003-04-15 16:50:00  \n",
       "2           -999  20691  ace0301 2003-04-15 17:50:00  \n",
       "3           -999  20675  ace0301 2003-04-17 18:15:00  \n",
       "4           -999  20691  ace0302 2003-07-21 18:27:00  \n",
       "\n",
       "[5 rows x 212 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This DataFrame quite large and unwieldy with 212 columns. But Pandas makes it easy to extract the necessary data for a particular project. For my current project, which I'll go over in a subsequent post, I need field data relevant to the [SeaWiFS sensor](https://en.wikipedia.org/wiki/SeaWiFS), in particular optical data at wavelengths 412, 443, 490, 510, 555, and 670 nm. First let's look at the available bands as they appear in spectral surface irradiance column labels, which start with 'es'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['405', '411', '443', '455', '465', '489', '510', '520', '530', '550', '555', '560', '565', '570', '590', '619', '625', '665', '670', '683']\n"
     ]
    }
   ],
   "source": [
    "bandregex = re.compile('es([0-9]+)')\n",
    "bands = bandregex.findall(''.join(df.columns))\n",
    "print(bands)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now I can extract data with bands that are the closest to what I need. In the process I'm going to use water leaving radiance and spectral surface irradiance to compute remote sensing reflectance, rrs. I will store this new data in a new DataFrame, dfSwf."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "swfBands = ['411','443','489','510','555','670']\n",
    "dfSwf = pd.DataFrame(columns=['rrs%s' % b for b in swfBands])\n",
    "for b in swfBands:\n",
    "    dfSwf.loc[:,'rrs%s'%b] = df.loc[:,'lw%s' % b].astype('f8') / df.loc[:,'es%s' % b].astype('f8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rrs411</th>\n",
       "      <th>rrs443</th>\n",
       "      <th>rrs489</th>\n",
       "      <th>rrs510</th>\n",
       "      <th>rrs555</th>\n",
       "      <th>rrs670</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.001204</td>\n",
       "      <td>0.001686</td>\n",
       "      <td>0.003293</td>\n",
       "      <td>0.004036</td>\n",
       "      <td>0.007479</td>\n",
       "      <td>0.003465</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.001062</td>\n",
       "      <td>0.001384</td>\n",
       "      <td>0.002173</td>\n",
       "      <td>0.002499</td>\n",
       "      <td>0.004152</td>\n",
       "      <td>0.001695</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000971</td>\n",
       "      <td>0.001185</td>\n",
       "      <td>0.001843</td>\n",
       "      <td>0.002288</td>\n",
       "      <td>0.004246</td>\n",
       "      <td>0.001612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.001472</td>\n",
       "      <td>0.001741</td>\n",
       "      <td>0.002877</td>\n",
       "      <td>0.003664</td>\n",
       "      <td>0.006982</td>\n",
       "      <td>0.003234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000905</td>\n",
       "      <td>0.001022</td>\n",
       "      <td>0.001506</td>\n",
       "      <td>0.001903</td>\n",
       "      <td>0.002801</td>\n",
       "      <td>0.001791</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     rrs411    rrs443    rrs489    rrs510    rrs555    rrs670\n",
       "0  0.001204  0.001686  0.003293  0.004036  0.007479  0.003465\n",
       "1  0.001062  0.001384  0.002173  0.002499  0.004152  0.001695\n",
       "2  0.000971  0.001185  0.001843  0.002288  0.004246  0.001612\n",
       "3  0.001472  0.001741  0.002877  0.003664  0.006982  0.003234\n",
       "4  0.000905  0.001022  0.001506  0.001903  0.002801  0.001791"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfSwf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the projects I'm currently working on, I'll need to select a few more features from the inital dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dfSwf['id'] = df.id.astype('i4') # in case I need to relate this data to the original\n",
    "dfSwf['datetime'] = df.Datetime\n",
    "dfSwf['hplc_chl'] = df.chl_a.astype('f8')\n",
    "dfSwf['fluo_chl'] = df.chl.astype('f8')\n",
    "dfSwf['lat'] = df.lat.astype('f8')\n",
    "dfSwf['lon'] = df.lon.astype('f8')\n",
    "dfSwf['depth'] = df.etopo2.astype('f8')\n",
    "dfSwf['sst'] = df.oisst.astype('f8')\n",
    "for band in swfBands:\n",
    "    addprods=['a','ad','ag','ap','bb']\n",
    "    for prod in addprods:\n",
    "        dfSwf['%s%s' % (prod,band)] = df['%s%s' % (prod, band)].astype('f8')\n",
    "dfSwf.replace(-999,np.nan, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tallying the features I've gathered..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['rrs411', 'rrs443', 'rrs489', 'rrs510', 'rrs555', 'rrs670', 'id',\n",
      "       'datetime', 'hplc_chl', 'fluo_chl', 'lat', 'lon', 'depth', 'sst',\n",
      "       'a411', 'ad411', 'ag411', 'ap411', 'bb411', 'a443', 'ad443', 'ag443',\n",
      "       'ap443', 'bb443', 'a489', 'ad489', 'ag489', 'ap489', 'bb489', 'a510',\n",
      "       'ad510', 'ag510', 'ap510', 'bb510', 'a555', 'ad555', 'ag555', 'ap555',\n",
      "       'bb555', 'a670', 'ad670', 'ag670', 'ap670', 'bb670'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print(dfSwf.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That seems like a good dataset to start with. I'll pickle this DataFrame just in case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfSwf.to_pickle('./bayesianChl_DATA/dfNomadSWF.pkl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first project that I'll first tackle is a recasting of the OCx empirical band ratio algorithms within a Bayesian framework. For that I can further cull the dataset following the \"Data Source\" section in a paper I am using for comparison by [Hu *et al.*, 2012](http://onlinelibrary.wiley.com/doi/10.1029/2011JC007395/pdf). This study draws from this same data set, applying the following criteria:\n",
    "* only hplc chlorophyll\n",
    "* chl>0 where rrs>0\n",
    "* depth>30\n",
    "* lat $\\in\\left[-60,60\\right]$\n",
    "\n",
    "Applying these criteria should result in a dataset reduced to ***136*** observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "rrsCols = [col for col in dfSwf.columns if 'rrs' in col]\n",
    "iwantcols=rrsCols + ['id', 'depth','hplc_chl','sst','lat','lon']\n",
    "dfSwfHu = dfSwf[iwantcols].copy()\n",
    "del dfSwf, df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 4459 entries, 0 to 4458\n",
      "Data columns (total 12 columns):\n",
      "rrs411      4459 non-null float64\n",
      "rrs443      4459 non-null float64\n",
      "rrs489      4459 non-null float64\n",
      "rrs510      4459 non-null float64\n",
      "rrs555      4459 non-null float64\n",
      "rrs670      4459 non-null float64\n",
      "id          4459 non-null int32\n",
      "depth       4459 non-null float64\n",
      "hplc_chl    1381 non-null float64\n",
      "sst         4459 non-null float64\n",
      "lat         4459 non-null float64\n",
      "lon         4459 non-null float64\n",
      "dtypes: float64(11), int32(1)\n",
      "memory usage: 400.7 KB\n"
     ]
    }
   ],
   "source": [
    "dfSwfHu.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apparently the only null entries are in the hplc_chl column. Dropping the nulls in that column takes care of the first of the criteria listed above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dfSwfHu.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rrs411</th>\n",
       "      <th>rrs443</th>\n",
       "      <th>rrs489</th>\n",
       "      <th>rrs510</th>\n",
       "      <th>rrs555</th>\n",
       "      <th>rrs670</th>\n",
       "      <th>id</th>\n",
       "      <th>depth</th>\n",
       "      <th>hplc_chl</th>\n",
       "      <th>sst</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "      <td>1381.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.107953</td>\n",
       "      <td>0.005607</td>\n",
       "      <td>0.005228</td>\n",
       "      <td>0.120503</td>\n",
       "      <td>0.103273</td>\n",
       "      <td>0.585099</td>\n",
       "      <td>5175.859522</td>\n",
       "      <td>1936.900072</td>\n",
       "      <td>2.285293</td>\n",
       "      <td>19.159754</td>\n",
       "      <td>11.752954</td>\n",
       "      <td>-53.511340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.302128</td>\n",
       "      <td>0.003567</td>\n",
       "      <td>0.002890</td>\n",
       "      <td>0.319626</td>\n",
       "      <td>0.298920</td>\n",
       "      <td>0.492136</td>\n",
       "      <td>2161.341423</td>\n",
       "      <td>1998.475771</td>\n",
       "      <td>5.752391</td>\n",
       "      <td>7.629313</td>\n",
       "      <td>32.350239</td>\n",
       "      <td>60.334355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-0.000288</td>\n",
       "      <td>0.000190</td>\n",
       "      <td>0.000422</td>\n",
       "      <td>0.000304</td>\n",
       "      <td>0.000218</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>644.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.017000</td>\n",
       "      <td>-1.460000</td>\n",
       "      <td>-64.418600</td>\n",
       "      <td>-177.004000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.002788</td>\n",
       "      <td>0.002884</td>\n",
       "      <td>0.003346</td>\n",
       "      <td>0.003098</td>\n",
       "      <td>0.001663</td>\n",
       "      <td>0.001184</td>\n",
       "      <td>2853.000000</td>\n",
       "      <td>39.000000</td>\n",
       "      <td>0.145000</td>\n",
       "      <td>15.090000</td>\n",
       "      <td>-10.776700</td>\n",
       "      <td>-88.669400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.005102</td>\n",
       "      <td>0.004763</td>\n",
       "      <td>0.004880</td>\n",
       "      <td>0.003804</td>\n",
       "      <td>0.002305</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>6181.000000</td>\n",
       "      <td>753.000000</td>\n",
       "      <td>0.538000</td>\n",
       "      <td>20.080000</td>\n",
       "      <td>29.842400</td>\n",
       "      <td>-63.852000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.009592</td>\n",
       "      <td>0.007798</td>\n",
       "      <td>0.006380</td>\n",
       "      <td>0.005700</td>\n",
       "      <td>0.005715</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>6796.000000</td>\n",
       "      <td>3992.000000</td>\n",
       "      <td>1.694000</td>\n",
       "      <td>25.450000</td>\n",
       "      <td>34.298000</td>\n",
       "      <td>-21.500800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.027601</td>\n",
       "      <td>0.025900</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>7765.000000</td>\n",
       "      <td>6041.000000</td>\n",
       "      <td>70.213300</td>\n",
       "      <td>30.760000</td>\n",
       "      <td>54.000300</td>\n",
       "      <td>173.920000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            rrs411       rrs443       rrs489       rrs510       rrs555  \\\n",
       "count  1381.000000  1381.000000  1381.000000  1381.000000  1381.000000   \n",
       "mean      0.107953     0.005607     0.005228     0.120503     0.103273   \n",
       "std       0.302128     0.003567     0.002890     0.319626     0.298920   \n",
       "min      -0.000288     0.000190     0.000422     0.000304     0.000218   \n",
       "25%       0.002788     0.002884     0.003346     0.003098     0.001663   \n",
       "50%       0.005102     0.004763     0.004880     0.003804     0.002305   \n",
       "75%       0.009592     0.007798     0.006380     0.005700     0.005715   \n",
       "max       1.000000     0.027601     0.025900     1.000000     1.000000   \n",
       "\n",
       "            rrs670           id        depth     hplc_chl          sst  \\\n",
       "count  1381.000000  1381.000000  1381.000000  1381.000000  1381.000000   \n",
       "mean      0.585099  5175.859522  1936.900072     2.285293    19.159754   \n",
       "std       0.492136  2161.341423  1998.475771     5.752391     7.629313   \n",
       "min       0.000000   644.000000     0.000000     0.017000    -1.460000   \n",
       "25%       0.001184  2853.000000    39.000000     0.145000    15.090000   \n",
       "50%       1.000000  6181.000000   753.000000     0.538000    20.080000   \n",
       "75%       1.000000  6796.000000  3992.000000     1.694000    25.450000   \n",
       "max       1.000000  7765.000000  6041.000000    70.213300    30.760000   \n",
       "\n",
       "               lat          lon  \n",
       "count  1381.000000  1381.000000  \n",
       "mean     11.752954   -53.511340  \n",
       "std      32.350239    60.334355  \n",
       "min     -64.418600  -177.004000  \n",
       "25%     -10.776700   -88.669400  \n",
       "50%      29.842400   -63.852000  \n",
       "75%      34.298000   -21.500800  \n",
       "max      54.000300   173.920000  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfSwfHu.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According to the summary table above, I don't need to worry about 0 chl as per the criteria above. However, it appears several reflectances have spurious 1.0000 values. Since these were never mentioned in the paper, I'll first cull the dataset according to depth and lat criteria, see if that takes care of cleaning those values as well. This should land me with 136 observations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dfSwfHu=dfSwfHu.loc[((dfSwfHu.depth>30) &\\\n",
    "                     (dfSwfHu.lat>=-60) & (dfSwfHu.lat<=60)),:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rrs411</th>\n",
       "      <th>rrs443</th>\n",
       "      <th>rrs489</th>\n",
       "      <th>rrs510</th>\n",
       "      <th>rrs555</th>\n",
       "      <th>rrs670</th>\n",
       "      <th>id</th>\n",
       "      <th>depth</th>\n",
       "      <th>hplc_chl</th>\n",
       "      <th>sst</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "      <td>964.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.152406</td>\n",
       "      <td>0.005741</td>\n",
       "      <td>0.004797</td>\n",
       "      <td>0.161636</td>\n",
       "      <td>0.145059</td>\n",
       "      <td>0.705510</td>\n",
       "      <td>5347.700207</td>\n",
       "      <td>2552.714730</td>\n",
       "      <td>1.094060</td>\n",
       "      <td>19.800280</td>\n",
       "      <td>11.748191</td>\n",
       "      <td>-42.971192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.352490</td>\n",
       "      <td>0.003378</td>\n",
       "      <td>0.002059</td>\n",
       "      <td>0.364331</td>\n",
       "      <td>0.349634</td>\n",
       "      <td>0.455923</td>\n",
       "      <td>2059.463026</td>\n",
       "      <td>1963.955594</td>\n",
       "      <td>3.129181</td>\n",
       "      <td>6.061552</td>\n",
       "      <td>28.743822</td>\n",
       "      <td>67.950684</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-0.000212</td>\n",
       "      <td>0.000190</td>\n",
       "      <td>0.000422</td>\n",
       "      <td>0.000304</td>\n",
       "      <td>0.000218</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>644.000000</td>\n",
       "      <td>31.000000</td>\n",
       "      <td>0.017000</td>\n",
       "      <td>2.020000</td>\n",
       "      <td>-59.756300</td>\n",
       "      <td>-177.004000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.003000</td>\n",
       "      <td>0.002900</td>\n",
       "      <td>0.003256</td>\n",
       "      <td>0.002952</td>\n",
       "      <td>0.001592</td>\n",
       "      <td>0.000598</td>\n",
       "      <td>4260.750000</td>\n",
       "      <td>338.500000</td>\n",
       "      <td>0.114000</td>\n",
       "      <td>15.545000</td>\n",
       "      <td>-12.503000</td>\n",
       "      <td>-117.247750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.006208</td>\n",
       "      <td>0.005254</td>\n",
       "      <td>0.004719</td>\n",
       "      <td>0.003620</td>\n",
       "      <td>0.001988</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>6198.500000</td>\n",
       "      <td>3066.500000</td>\n",
       "      <td>0.265000</td>\n",
       "      <td>19.410000</td>\n",
       "      <td>23.902400</td>\n",
       "      <td>-39.868200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.011122</td>\n",
       "      <td>0.008080</td>\n",
       "      <td>0.006190</td>\n",
       "      <td>0.004477</td>\n",
       "      <td>0.003064</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>6662.500000</td>\n",
       "      <td>4312.000000</td>\n",
       "      <td>0.936500</td>\n",
       "      <td>25.360000</td>\n",
       "      <td>34.252500</td>\n",
       "      <td>-17.495375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.016246</td>\n",
       "      <td>0.019676</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>7760.000000</td>\n",
       "      <td>6041.000000</td>\n",
       "      <td>53.002700</td>\n",
       "      <td>30.180000</td>\n",
       "      <td>54.000300</td>\n",
       "      <td>173.920000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           rrs411      rrs443      rrs489      rrs510      rrs555      rrs670  \\\n",
       "count  964.000000  964.000000  964.000000  964.000000  964.000000  964.000000   \n",
       "mean     0.152406    0.005741    0.004797    0.161636    0.145059    0.705510   \n",
       "std      0.352490    0.003378    0.002059    0.364331    0.349634    0.455923   \n",
       "min     -0.000212    0.000190    0.000422    0.000304    0.000218    0.000000   \n",
       "25%      0.003000    0.002900    0.003256    0.002952    0.001592    0.000598   \n",
       "50%      0.006208    0.005254    0.004719    0.003620    0.001988    1.000000   \n",
       "75%      0.011122    0.008080    0.006190    0.004477    0.003064    1.000000   \n",
       "max      1.000000    0.016246    0.019676    1.000000    1.000000    1.000000   \n",
       "\n",
       "                id        depth    hplc_chl         sst         lat  \\\n",
       "count   964.000000   964.000000  964.000000  964.000000  964.000000   \n",
       "mean   5347.700207  2552.714730    1.094060   19.800280   11.748191   \n",
       "std    2059.463026  1963.955594    3.129181    6.061552   28.743822   \n",
       "min     644.000000    31.000000    0.017000    2.020000  -59.756300   \n",
       "25%    4260.750000   338.500000    0.114000   15.545000  -12.503000   \n",
       "50%    6198.500000  3066.500000    0.265000   19.410000   23.902400   \n",
       "75%    6662.500000  4312.000000    0.936500   25.360000   34.252500   \n",
       "max    7760.000000  6041.000000   53.002700   30.180000   54.000300   \n",
       "\n",
       "              lon  \n",
       "count  964.000000  \n",
       "mean   -42.971192  \n",
       "std     67.950684  \n",
       "min   -177.004000  \n",
       "25%   -117.247750  \n",
       "50%    -39.868200  \n",
       "75%    -17.495375  \n",
       "max    173.920000  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfSwfHu.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nope. We're down to 964 observations. So much for reproducibility via publication. Getting rid of spurions rrs values..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dfSwfHu = dfSwfHu.loc[((dfSwfHu.rrs411<1.0) & (dfSwfHu.rrs510<1.0)&\\\n",
    "                               (dfSwfHu.rrs555<1.0) & (dfSwfHu.rrs670<1.0)),:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rrs411</th>\n",
       "      <th>rrs443</th>\n",
       "      <th>rrs489</th>\n",
       "      <th>rrs510</th>\n",
       "      <th>rrs555</th>\n",
       "      <th>rrs670</th>\n",
       "      <th>id</th>\n",
       "      <th>depth</th>\n",
       "      <th>hplc_chl</th>\n",
       "      <th>sst</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "      <td>136.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.005336</td>\n",
       "      <td>0.004684</td>\n",
       "      <td>0.004071</td>\n",
       "      <td>0.003217</td>\n",
       "      <td>0.002535</td>\n",
       "      <td>0.000594</td>\n",
       "      <td>6240.088235</td>\n",
       "      <td>2155.500000</td>\n",
       "      <td>1.942732</td>\n",
       "      <td>21.763382</td>\n",
       "      <td>12.399556</td>\n",
       "      <td>-72.949479</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.005361</td>\n",
       "      <td>0.003904</td>\n",
       "      <td>0.002099</td>\n",
       "      <td>0.001401</td>\n",
       "      <td>0.001920</td>\n",
       "      <td>0.001094</td>\n",
       "      <td>1922.935381</td>\n",
       "      <td>2018.518092</td>\n",
       "      <td>6.550881</td>\n",
       "      <td>6.950208</td>\n",
       "      <td>25.752077</td>\n",
       "      <td>52.987492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000051</td>\n",
       "      <td>0.000190</td>\n",
       "      <td>0.000422</td>\n",
       "      <td>0.000497</td>\n",
       "      <td>0.000639</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2640.000000</td>\n",
       "      <td>31.000000</td>\n",
       "      <td>0.017000</td>\n",
       "      <td>5.260000</td>\n",
       "      <td>-35.164400</td>\n",
       "      <td>-170.045000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.001404</td>\n",
       "      <td>0.001766</td>\n",
       "      <td>0.002409</td>\n",
       "      <td>0.002365</td>\n",
       "      <td>0.001568</td>\n",
       "      <td>0.000094</td>\n",
       "      <td>5903.750000</td>\n",
       "      <td>64.000000</td>\n",
       "      <td>0.145750</td>\n",
       "      <td>16.380000</td>\n",
       "      <td>-1.261000</td>\n",
       "      <td>-90.375800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.002839</td>\n",
       "      <td>0.002850</td>\n",
       "      <td>0.003435</td>\n",
       "      <td>0.003235</td>\n",
       "      <td>0.001857</td>\n",
       "      <td>0.000175</td>\n",
       "      <td>7226.500000</td>\n",
       "      <td>2809.500000</td>\n",
       "      <td>0.451500</td>\n",
       "      <td>25.625000</td>\n",
       "      <td>11.413400</td>\n",
       "      <td>-73.367600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.007855</td>\n",
       "      <td>0.007016</td>\n",
       "      <td>0.005809</td>\n",
       "      <td>0.003892</td>\n",
       "      <td>0.002625</td>\n",
       "      <td>0.000503</td>\n",
       "      <td>7314.000000</td>\n",
       "      <td>4305.750000</td>\n",
       "      <td>1.130750</td>\n",
       "      <td>27.290000</td>\n",
       "      <td>37.357600</td>\n",
       "      <td>-56.020225</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>0.022010</td>\n",
       "      <td>0.016246</td>\n",
       "      <td>0.009500</td>\n",
       "      <td>0.009600</td>\n",
       "      <td>0.012200</td>\n",
       "      <td>0.007900</td>\n",
       "      <td>7747.000000</td>\n",
       "      <td>5526.000000</td>\n",
       "      <td>53.002700</td>\n",
       "      <td>30.180000</td>\n",
       "      <td>43.619200</td>\n",
       "      <td>170.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           rrs411      rrs443      rrs489      rrs510      rrs555      rrs670  \\\n",
       "count  136.000000  136.000000  136.000000  136.000000  136.000000  136.000000   \n",
       "mean     0.005336    0.004684    0.004071    0.003217    0.002535    0.000594   \n",
       "std      0.005361    0.003904    0.002099    0.001401    0.001920    0.001094   \n",
       "min      0.000051    0.000190    0.000422    0.000497    0.000639    0.000000   \n",
       "25%      0.001404    0.001766    0.002409    0.002365    0.001568    0.000094   \n",
       "50%      0.002839    0.002850    0.003435    0.003235    0.001857    0.000175   \n",
       "75%      0.007855    0.007016    0.005809    0.003892    0.002625    0.000503   \n",
       "max      0.022010    0.016246    0.009500    0.009600    0.012200    0.007900   \n",
       "\n",
       "                id        depth    hplc_chl         sst         lat  \\\n",
       "count   136.000000   136.000000  136.000000  136.000000  136.000000   \n",
       "mean   6240.088235  2155.500000    1.942732   21.763382   12.399556   \n",
       "std    1922.935381  2018.518092    6.550881    6.950208   25.752077   \n",
       "min    2640.000000    31.000000    0.017000    5.260000  -35.164400   \n",
       "25%    5903.750000    64.000000    0.145750   16.380000   -1.261000   \n",
       "50%    7226.500000  2809.500000    0.451500   25.625000   11.413400   \n",
       "75%    7314.000000  4305.750000    1.130750   27.290000   37.357600   \n",
       "max    7747.000000  5526.000000   53.002700   30.180000   43.619200   \n",
       "\n",
       "              lon  \n",
       "count  136.000000  \n",
       "mean   -72.949479  \n",
       "std     52.987492  \n",
       "min   -170.045000  \n",
       "25%    -90.375800  \n",
       "50%    -73.367600  \n",
       "75%    -56.020225  \n",
       "max    170.000000  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfSwfHu.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "136 values. Success! Once again, I'll pickle this DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dfSwfHu.to_pickle('/accounts/ekarakoy/DATA/NOMAD/dfSwfHuOcxCI_2012.pkl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it. Until next time, *Happy Hacking!*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "nikola": {
   "category": "",
   "date": "2017-03-15 14:10:25 UTC-04:00",
   "description": "",
   "link": "",
   "slug": "getting-nomadata-into-a-pandas-dataframe",
   "tags": "ocean color, pandas, chlorophyll",
   "title": "Getting the NASA bio-Optical Marine Algorithm Dataset (NOMAD) into a Pandas DataFrame",
   "type": "text"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}