{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In this post, I'm going to briefly describe how a I download the [NASA bio-Optical Marine Algorithm Dataset or NOMAD](https://seabass.gsfc.nasa.gov/wiki/NOMAD) created for algorithm development, extract the data I need and store it all neatly in a Pandas DataFrame. Here I use the latest dataset, NOMAD v.2, created in 2008.\n", "\n", "First things first; imports!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "import pandas as pd\n", "import re\n", "import numpy as np\n", "import pickle\n", "from IPython.core.display import display, HTML" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(HTML(\"\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data can be accessed through a URL that I'll store in a string below." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "NOMADV2url='https://seabass.gsfc.nasa.gov/wiki/NOMAD/nomad_seabass_v2.a_2008200.txt'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, I'll write a couple of functions. The first to get the data from the url. The second function will parse the text returned by the first function and put in a Pandas DataFrame. This second function makes more sense after inspecting the content of the page at the url above." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def GetNomad(url=NOMADV2url):\n", " \"\"\"Download and return data as text\"\"\"\n", " resp = requests.get(NOMADV2url)\n", " content = resp.text.splitlines() \n", " resp.close()\n", " return content\n", "\n", "def ParseTextFile(textFile, topickle=False, convert2DateTime=False, **kwargs):\n", " \"\"\"\n", " * topickle: pickle resulting DataFrame if True\n", " * convert2DateTime: join date/time columns and convert entries to datetime objects\n", " * kwargs:\n", " pkl_fname: pickle file name to save DataFrame by, if topickle=True\n", " \"\"\"\n", " # Pre-compute some regex\n", " columns = re.compile('^/fields=(.+)') # to get field/column names\n", " units = re.compile('^/units=(.+)') # to get units -- optional\n", " endHeader = re.compile('^/end_header') # to know when to start storing data\n", " # Set some milestones\n", " noFields = True\n", " getData = False\n", " # loop through the text data\n", " for line in textFile:\n", " if noFields:\n", " fieldStr = columns.findall(line)\n", " if len(fieldStr)>0:\n", " noFields = False\n", " fieldList = fieldStr[0].split(',')\n", " dataDict = dict.fromkeys(fieldList)\n", " continue # nothing left to do with this line, keep looping\n", " if not getData:\n", " if endHeader.match(line):\n", " # end of header reached, start acquiring data\n", " getData = True \n", " else:\n", " dataList = line.split(',')\n", " for field,datum in zip(fieldList, dataList):\n", " if not dataDict[field]:\n", " dataDict[field] = []\n", " dataDict[field].append(datum)\n", " df = pd.DataFrame(dataDict, columns=fieldList)\n", " if convert2DateTime:\n", " datetimelabels=['year', 'month', 'day', 'hour', 'minute', 'second']\n", " df['Datetime']= pd.to_datetime(df[datetimelabels],\n", " format='%Y-%m-%dT%H:%M:%S')\n", " df.drop(datetimelabels, axis=1, inplace=True)\n", " if topickle:\n", " fname=kwargs.pop('pkl_fname', 'dfNomad2.pkl')\n", " df.to_pickle(fname)\n", " return df" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df = ParseTextFile(GetNomad(), topickle=True, convert2DateTime=True,\n", " pkl_fname='./bayesianChl_DATA/dfNomadRaw.pkl')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonidoisstetopo2chlchl_akd405kd411kd443...diatolutzeachl_bbeta-caralpha-caralpha-beta-carflagcruiseDatetime
038.4279-76.6115653.70.038.19-999-9993.94553.1457...-999-999-999-999-999-999-99920691ace03012003-04-15 15:15:00
138.368-76.515663.70.035.01-999-9992.56372.0529...-999-999-999-999-999-999-99920675ace03012003-04-15 16:50:00
238.3074-76.4415673.7126.91-999-9992.15331.7531...-999-999-999-999-999-999-99920691ace03012003-04-15 17:50:00
338.6367-76.3215683.7347.96-999-9992.692.2985...-999-999-999-999-999-999-99920675ace03012003-04-17 18:15:00
438.3047-76.44155922.03123.55-999-9993.0952.3966...-999-999-999-999-999-999-99920691ace03022003-07-21 18:27:00
\n", "

5 rows × 212 columns

\n", "
" ], "text/plain": [ " lat lon id oisst etopo2 chl chl_a kd405 kd411 kd443 \\\n", "0 38.4279 -76.61 1565 3.7 0.0 38.19 -999 -999 3.9455 3.1457 \n", "1 38.368 -76.5 1566 3.7 0.0 35.01 -999 -999 2.5637 2.0529 \n", "2 38.3074 -76.44 1567 3.7 1 26.91 -999 -999 2.1533 1.7531 \n", "3 38.6367 -76.32 1568 3.7 3 47.96 -999 -999 2.69 2.2985 \n", "4 38.3047 -76.44 1559 22.03 1 23.55 -999 -999 3.095 2.3966 \n", "\n", " ... diato lut zea chl_b beta-car alpha-car \\\n", "0 ... -999 -999 -999 -999 -999 -999 \n", "1 ... -999 -999 -999 -999 -999 -999 \n", "2 ... -999 -999 -999 -999 -999 -999 \n", "3 ... -999 -999 -999 -999 -999 -999 \n", "4 ... -999 -999 -999 -999 -999 -999 \n", "\n", " alpha-beta-car flag cruise Datetime \n", "0 -999 20691 ace0301 2003-04-15 15:15:00 \n", "1 -999 20675 ace0301 2003-04-15 16:50:00 \n", "2 -999 20691 ace0301 2003-04-15 17:50:00 \n", "3 -999 20675 ace0301 2003-04-17 18:15:00 \n", "4 -999 20691 ace0302 2003-07-21 18:27:00 \n", "\n", "[5 rows x 212 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This DataFrame quite large and unwieldy with 212 columns. But Pandas makes it easy to extract the necessary data for a particular project. For my current project, which I'll go over in a subsequent post, I need field data relevant to the [SeaWiFS sensor](https://en.wikipedia.org/wiki/SeaWiFS), in particular optical data at wavelengths 412, 443, 490, 510, 555, and 670 nm. First let's look at the available bands as they appear in spectral surface irradiance column labels, which start with 'es'." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['405', '411', '443', '455', '465', '489', '510', '520', '530', '550', '555', '560', '565', '570', '590', '619', '625', '665', '670', '683']\n" ] } ], "source": [ "bandregex = re.compile('es([0-9]+)')\n", "bands = bandregex.findall(''.join(df.columns))\n", "print(bands)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I can extract data with bands that are the closest to what I need. In the process I'm going to use water leaving radiance and spectral surface irradiance to compute remote sensing reflectance, rrs. I will store this new data in a new DataFrame, dfSwf." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "swfBands = ['411','443','489','510','555','670']\n", "dfSwf = pd.DataFrame(columns=['rrs%s' % b for b in swfBands])\n", "for b in swfBands:\n", " dfSwf.loc[:,'rrs%s'%b] = df.loc[:,'lw%s' % b].astype('f8') / df.loc[:,'es%s' % b].astype('f8')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rrs411rrs443rrs489rrs510rrs555rrs670
00.0012040.0016860.0032930.0040360.0074790.003465
10.0010620.0013840.0021730.0024990.0041520.001695
20.0009710.0011850.0018430.0022880.0042460.001612
30.0014720.0017410.0028770.0036640.0069820.003234
40.0009050.0010220.0015060.0019030.0028010.001791
\n", "
" ], "text/plain": [ " rrs411 rrs443 rrs489 rrs510 rrs555 rrs670\n", "0 0.001204 0.001686 0.003293 0.004036 0.007479 0.003465\n", "1 0.001062 0.001384 0.002173 0.002499 0.004152 0.001695\n", "2 0.000971 0.001185 0.001843 0.002288 0.004246 0.001612\n", "3 0.001472 0.001741 0.002877 0.003664 0.006982 0.003234\n", "4 0.000905 0.001022 0.001506 0.001903 0.002801 0.001791" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfSwf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the projects I'm currently working on, I'll need to select a few more features from the inital dataset. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dfSwf['id'] = df.id.astype('i4') # in case I need to relate this data to the original\n", "dfSwf['datetime'] = df.Datetime\n", "dfSwf['hplc_chl'] = df.chl_a.astype('f8')\n", "dfSwf['fluo_chl'] = df.chl.astype('f8')\n", "dfSwf['lat'] = df.lat.astype('f8')\n", "dfSwf['lon'] = df.lon.astype('f8')\n", "dfSwf['depth'] = df.etopo2.astype('f8')\n", "dfSwf['sst'] = df.oisst.astype('f8')\n", "for band in swfBands:\n", " addprods=['a','ad','ag','ap','bb']\n", " for prod in addprods:\n", " dfSwf['%s%s' % (prod,band)] = df['%s%s' % (prod, band)].astype('f8')\n", "dfSwf.replace(-999,np.nan, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tallying the features I've gathered..." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['rrs411', 'rrs443', 'rrs489', 'rrs510', 'rrs555', 'rrs670', 'id',\n", " 'datetime', 'hplc_chl', 'fluo_chl', 'lat', 'lon', 'depth', 'sst',\n", " 'a411', 'ad411', 'ag411', 'ap411', 'bb411', 'a443', 'ad443', 'ag443',\n", " 'ap443', 'bb443', 'a489', 'ad489', 'ag489', 'ap489', 'bb489', 'a510',\n", " 'ad510', 'ag510', 'ap510', 'bb510', 'a555', 'ad555', 'ag555', 'ap555',\n", " 'bb555', 'a670', 'ad670', 'ag670', 'ap670', 'bb670'],\n", " dtype='object')\n" ] } ], "source": [ "print(dfSwf.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That seems like a good dataset to start with. I'll pickle this DataFrame just in case." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "dfSwf.to_pickle('./bayesianChl_DATA/dfNomadSWF.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first project that I'll first tackle is a recasting of the OCx empirical band ratio algorithms within a Bayesian framework. For that I can further cull the dataset following the \"Data Source\" section in a paper I am using for comparison by [Hu *et al.*, 2012](http://onlinelibrary.wiley.com/doi/10.1029/2011JC007395/pdf). This study draws from this same data set, applying the following criteria:\n", "* only hplc chlorophyll\n", "* chl>0 where rrs>0\n", "* depth>30\n", "* lat $\\in\\left[-60,60\\right]$\n", "\n", "Applying these criteria should result in a dataset reduced to ***136*** observations." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "rrsCols = [col for col in dfSwf.columns if 'rrs' in col]\n", "iwantcols=rrsCols + ['id', 'depth','hplc_chl','sst','lat','lon']\n", "dfSwfHu = dfSwf[iwantcols].copy()\n", "del dfSwf, df" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 4459 entries, 0 to 4458\n", "Data columns (total 12 columns):\n", "rrs411 4459 non-null float64\n", "rrs443 4459 non-null float64\n", "rrs489 4459 non-null float64\n", "rrs510 4459 non-null float64\n", "rrs555 4459 non-null float64\n", "rrs670 4459 non-null float64\n", "id 4459 non-null int32\n", "depth 4459 non-null float64\n", "hplc_chl 1381 non-null float64\n", "sst 4459 non-null float64\n", "lat 4459 non-null float64\n", "lon 4459 non-null float64\n", "dtypes: float64(11), int32(1)\n", "memory usage: 400.7 KB\n" ] } ], "source": [ "dfSwfHu.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently the only null entries are in the hplc_chl column. Dropping the nulls in that column takes care of the first of the criteria listed above." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dfSwfHu.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rrs411rrs443rrs489rrs510rrs555rrs670iddepthhplc_chlsstlatlon
count1381.0000001381.0000001381.0000001381.0000001381.0000001381.0000001381.0000001381.0000001381.0000001381.0000001381.0000001381.000000
mean0.1079530.0056070.0052280.1205030.1032730.5850995175.8595221936.9000722.28529319.15975411.752954-53.511340
std0.3021280.0035670.0028900.3196260.2989200.4921362161.3414231998.4757715.7523917.62931332.35023960.334355
min-0.0002880.0001900.0004220.0003040.0002180.000000644.0000000.0000000.017000-1.460000-64.418600-177.004000
25%0.0027880.0028840.0033460.0030980.0016630.0011842853.00000039.0000000.14500015.090000-10.776700-88.669400
50%0.0051020.0047630.0048800.0038040.0023051.0000006181.000000753.0000000.53800020.08000029.842400-63.852000
75%0.0095920.0077980.0063800.0057000.0057151.0000006796.0000003992.0000001.69400025.45000034.298000-21.500800
max1.0000000.0276010.0259001.0000001.0000001.0000007765.0000006041.00000070.21330030.76000054.000300173.920000
\n", "
" ], "text/plain": [ " rrs411 rrs443 rrs489 rrs510 rrs555 \\\n", "count 1381.000000 1381.000000 1381.000000 1381.000000 1381.000000 \n", "mean 0.107953 0.005607 0.005228 0.120503 0.103273 \n", "std 0.302128 0.003567 0.002890 0.319626 0.298920 \n", "min -0.000288 0.000190 0.000422 0.000304 0.000218 \n", "25% 0.002788 0.002884 0.003346 0.003098 0.001663 \n", "50% 0.005102 0.004763 0.004880 0.003804 0.002305 \n", "75% 0.009592 0.007798 0.006380 0.005700 0.005715 \n", "max 1.000000 0.027601 0.025900 1.000000 1.000000 \n", "\n", " rrs670 id depth hplc_chl sst \\\n", "count 1381.000000 1381.000000 1381.000000 1381.000000 1381.000000 \n", "mean 0.585099 5175.859522 1936.900072 2.285293 19.159754 \n", "std 0.492136 2161.341423 1998.475771 5.752391 7.629313 \n", "min 0.000000 644.000000 0.000000 0.017000 -1.460000 \n", "25% 0.001184 2853.000000 39.000000 0.145000 15.090000 \n", "50% 1.000000 6181.000000 753.000000 0.538000 20.080000 \n", "75% 1.000000 6796.000000 3992.000000 1.694000 25.450000 \n", "max 1.000000 7765.000000 6041.000000 70.213300 30.760000 \n", "\n", " lat lon \n", "count 1381.000000 1381.000000 \n", "mean 11.752954 -53.511340 \n", "std 32.350239 60.334355 \n", "min -64.418600 -177.004000 \n", "25% -10.776700 -88.669400 \n", "50% 29.842400 -63.852000 \n", "75% 34.298000 -21.500800 \n", "max 54.000300 173.920000 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfSwfHu.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the summary table above, I don't need to worry about 0 chl as per the criteria above. However, it appears several reflectances have spurious 1.0000 values. Since these were never mentioned in the paper, I'll first cull the dataset according to depth and lat criteria, see if that takes care of cleaning those values as well. This should land me with 136 observations" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dfSwfHu=dfSwfHu.loc[((dfSwfHu.depth>30) &\\\n", " (dfSwfHu.lat>=-60) & (dfSwfHu.lat<=60)),:]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rrs411rrs443rrs489rrs510rrs555rrs670iddepthhplc_chlsstlatlon
count964.000000964.000000964.000000964.000000964.000000964.000000964.000000964.000000964.000000964.000000964.000000964.000000
mean0.1524060.0057410.0047970.1616360.1450590.7055105347.7002072552.7147301.09406019.80028011.748191-42.971192
std0.3524900.0033780.0020590.3643310.3496340.4559232059.4630261963.9555943.1291816.06155228.74382267.950684
min-0.0002120.0001900.0004220.0003040.0002180.000000644.00000031.0000000.0170002.020000-59.756300-177.004000
25%0.0030000.0029000.0032560.0029520.0015920.0005984260.750000338.5000000.11400015.545000-12.503000-117.247750
50%0.0062080.0052540.0047190.0036200.0019881.0000006198.5000003066.5000000.26500019.41000023.902400-39.868200
75%0.0111220.0080800.0061900.0044770.0030641.0000006662.5000004312.0000000.93650025.36000034.252500-17.495375
max1.0000000.0162460.0196761.0000001.0000001.0000007760.0000006041.00000053.00270030.18000054.000300173.920000
\n", "
" ], "text/plain": [ " rrs411 rrs443 rrs489 rrs510 rrs555 rrs670 \\\n", "count 964.000000 964.000000 964.000000 964.000000 964.000000 964.000000 \n", "mean 0.152406 0.005741 0.004797 0.161636 0.145059 0.705510 \n", "std 0.352490 0.003378 0.002059 0.364331 0.349634 0.455923 \n", "min -0.000212 0.000190 0.000422 0.000304 0.000218 0.000000 \n", "25% 0.003000 0.002900 0.003256 0.002952 0.001592 0.000598 \n", "50% 0.006208 0.005254 0.004719 0.003620 0.001988 1.000000 \n", "75% 0.011122 0.008080 0.006190 0.004477 0.003064 1.000000 \n", "max 1.000000 0.016246 0.019676 1.000000 1.000000 1.000000 \n", "\n", " id depth hplc_chl sst lat \\\n", "count 964.000000 964.000000 964.000000 964.000000 964.000000 \n", "mean 5347.700207 2552.714730 1.094060 19.800280 11.748191 \n", "std 2059.463026 1963.955594 3.129181 6.061552 28.743822 \n", "min 644.000000 31.000000 0.017000 2.020000 -59.756300 \n", "25% 4260.750000 338.500000 0.114000 15.545000 -12.503000 \n", "50% 6198.500000 3066.500000 0.265000 19.410000 23.902400 \n", "75% 6662.500000 4312.000000 0.936500 25.360000 34.252500 \n", "max 7760.000000 6041.000000 53.002700 30.180000 54.000300 \n", "\n", " lon \n", "count 964.000000 \n", "mean -42.971192 \n", "std 67.950684 \n", "min -177.004000 \n", "25% -117.247750 \n", "50% -39.868200 \n", "75% -17.495375 \n", "max 173.920000 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfSwfHu.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nope. We're down to 964 observations. So much for reproducibility via publication. Getting rid of spurions rrs values..." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dfSwfHu = dfSwfHu.loc[((dfSwfHu.rrs411<1.0) & (dfSwfHu.rrs510<1.0)&\\\n", " (dfSwfHu.rrs555<1.0) & (dfSwfHu.rrs670<1.0)),:]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rrs411rrs443rrs489rrs510rrs555rrs670iddepthhplc_chlsstlatlon
count136.000000136.000000136.000000136.000000136.000000136.000000136.000000136.000000136.000000136.000000136.000000136.000000
mean0.0053360.0046840.0040710.0032170.0025350.0005946240.0882352155.5000001.94273221.76338212.399556-72.949479
std0.0053610.0039040.0020990.0014010.0019200.0010941922.9353812018.5180926.5508816.95020825.75207752.987492
min0.0000510.0001900.0004220.0004970.0006390.0000002640.00000031.0000000.0170005.260000-35.164400-170.045000
25%0.0014040.0017660.0024090.0023650.0015680.0000945903.75000064.0000000.14575016.380000-1.261000-90.375800
50%0.0028390.0028500.0034350.0032350.0018570.0001757226.5000002809.5000000.45150025.62500011.413400-73.367600
75%0.0078550.0070160.0058090.0038920.0026250.0005037314.0000004305.7500001.13075027.29000037.357600-56.020225
max0.0220100.0162460.0095000.0096000.0122000.0079007747.0000005526.00000053.00270030.18000043.619200170.000000
\n", "
" ], "text/plain": [ " rrs411 rrs443 rrs489 rrs510 rrs555 rrs670 \\\n", "count 136.000000 136.000000 136.000000 136.000000 136.000000 136.000000 \n", "mean 0.005336 0.004684 0.004071 0.003217 0.002535 0.000594 \n", "std 0.005361 0.003904 0.002099 0.001401 0.001920 0.001094 \n", "min 0.000051 0.000190 0.000422 0.000497 0.000639 0.000000 \n", "25% 0.001404 0.001766 0.002409 0.002365 0.001568 0.000094 \n", "50% 0.002839 0.002850 0.003435 0.003235 0.001857 0.000175 \n", "75% 0.007855 0.007016 0.005809 0.003892 0.002625 0.000503 \n", "max 0.022010 0.016246 0.009500 0.009600 0.012200 0.007900 \n", "\n", " id depth hplc_chl sst lat \\\n", "count 136.000000 136.000000 136.000000 136.000000 136.000000 \n", "mean 6240.088235 2155.500000 1.942732 21.763382 12.399556 \n", "std 1922.935381 2018.518092 6.550881 6.950208 25.752077 \n", "min 2640.000000 31.000000 0.017000 5.260000 -35.164400 \n", "25% 5903.750000 64.000000 0.145750 16.380000 -1.261000 \n", "50% 7226.500000 2809.500000 0.451500 25.625000 11.413400 \n", "75% 7314.000000 4305.750000 1.130750 27.290000 37.357600 \n", "max 7747.000000 5526.000000 53.002700 30.180000 43.619200 \n", "\n", " lon \n", "count 136.000000 \n", "mean -72.949479 \n", "std 52.987492 \n", "min -170.045000 \n", "25% -90.375800 \n", "50% -73.367600 \n", "75% -56.020225 \n", "max 170.000000 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfSwfHu.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "136 values. Success! Once again, I'll pickle this DataFrame." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dfSwfHu.to_pickle('/accounts/ekarakoy/DATA/NOMAD/dfSwfHuOcxCI_2012.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it. Until next time, *Happy Hacking!*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" }, "nikola": { "category": "", "date": "2017-03-15 14:10:25 UTC-04:00", "description": "", "link": "", "slug": "getting-nomadata-into-a-pandas-dataframe", "tags": "ocean color, pandas, chlorophyll", "title": "Getting the NASA bio-Optical Marine Algorithm Dataset (NOMAD) into a Pandas DataFrame", "type": "text" } }, "nbformat": 4, "nbformat_minor": 2 }