Correlation & Causation - Part 1

2019-07-15 16:46

Erdem Karaköylü

"Correlation is not causation" goes the saying, cautioning against spurious correlation. Causally related phenomena, however, do correlate. How then can one differentiate spurious correlation from causal correlation, when confronted with a modeling problem? The following recommendations and simple simulation like the one making the subject of this post go a long way.

use causal DAGs - Directed Acyclic Graphs - to represent the current understanding of how the various pieces of data might be causally influencing each other (more on DAGs below.)
be aware of at least the four basic confounding mechanisms that lead to spurious correlation; namely, the Collider, the Pipe, the Fork, the Descendent.
be aware that spurious correlation through confounding can lead to better model performance that are ultimately misguiding and that can lead further inquiry astray.

Automatically Checking Changes in my Citizenship ApplicationStatus

2017-12-13 15:40

Erdem Karaköylü

I am currently in the process of acquiring my US citizenship. It's a long process without a set schedule. I found myself frequently visiting the USCIS status page to see if any update had been posted. After a few iterations I grew tired and rather than waste time going to the website, enter my case number, navigate to the status page, and check if there were any changes, I decided to automate these steps with a python script.

In this post I detail the steps of the script, which does the status check for me and shoots me an e-mail when detecting a change. I then show how I made it into a cron job so that it would run automatically on a schedule of my choosing.

Here I leverage SELENIUM, BEAUTIFULSOUP, and SMTPLIB and PYTHON-CRONTAB.

Note that building my automaton requires scouting the pages it needs to navigate to identify the specific elements it needs to interact with; in this case only a handful.

Relating Principle Components To The Data Whence They Came

2017-10-23 14:54

Erdem Karaköylü

When preparing a dataset for machine learning, a common step is to reduce dimensions using principal component analysis (PCA). This is often necessary when multi-collinearity is an issue. Multi-collinearity is often encountered in multi- and hyper-spectral satellite data, where individual channels do not add much information to that carried by their neighbors.

Here I use data from the Hyperspectral Imager for the Coastal Ocean (HICO), with several thousand observations.

Faster comparison of two Numpy arrays

2017-09-25 10:52

Erdem Karaköylü

This short post details an easy way to speed up array comparison. My problem was that I had to compare two 3D arrays, A and B, These arrays are different only in their first dimension, and have equal $2^{nd}$ and $3^{rd}$ dimensions. I needed to find out whether any of the 2D arrays contained along the first dimension of, say, A were also present along the first dimension of B. The first approach one might try is based on nested for-loops. This becomes quickly unwieldy with even moderate-sized arrays. The faster alternative is to hash the 2D data in one of the arrays and store the hash table; I prefer to do that with the smaller array for space use efficiency. The next step is to go along the first dimension of the other array, hash the data and compare. This ends up requiring only serial for-loops. Note here that the Python version is 3.6, which implements numpy ndarray hashing differently than 2.x. Note also the use of f-strings; a nifty new feature of python 3.6.

Bayesian Approach to Chlorophyll Estimation from Satellite Remote Sensing

2017-04-10 12:41

Erdem Karaköylü

This post describes the use of bayesian polynomial regression to estimate chlorophyll from remote sensing reflectance; the output is contrasted to that obtained via the frequentist ordinary least squares regression.

Chlorophyll is estimated from ocean color remote sensing data using one of two main types of algorithms; semi-analytical or empirical. The latter is a polynomial model, where the input is a ratio of bands and the coefficients are obtained via ordinary least squares fitting. These models are usually more successful than their semi-analytical counterparts and as a result are at the forefront of the operational algorithmic arsenal used by the Ocean Biology Processing Group at NASA Goddard.

Getting the NASA bio-Optical Marine Algorithm Dataset (NOMAD) into a Pandas DataFrame

2017-03-15 14:10

Erdem Karaköylü

In this post, I'm going to briefly describe how a I download the NASA bio-Optical Marine Algorithm Dataset or NOMAD created for algorithm development, extract the data I need and store it all neatly in a Pandas DataFrame. Here I use the latest dataset, NOMAD v.2, created in 2008.

XARRAY & GEOVIEWS A new perspective on oceanographic data - part II

2017-02-20 14:30

Erdem Karaköylü

In a previous post, I introduced xarray with some simple manipulation and data plotting. In this super-short post, I'm going to do some more manipulation, using multiple input files to create a new dimension, reorganize the data and store them in multiple output files. All but with a few lines of code.

XARRAY & GEOVIEWS A new perspective on oceanographic data - part I

2017-02-20 11:16

Erdem Karaköylü

With this post I explore an alternative to ol' numpy; xarray. Numpy is still running under the hood but this very handy library applies the pandas concept of labeled dimension to large N-dimension arrays prevalent in scientific computing. The result is an ease of manipulation of dimensions without having to guess or remember what they correspond to. Moreover, xarray plays nicely with two other relatively new libraries:

dask, which enables out of core computation so that memory availability becomes much less an issue with large data sets;
GeoViews, a library sitting on top of HoloViews. The latter eases the burden of data visualization by offering an unusual approach that does away with step-by-step graphical coding and allows the user to concentrate how the data organization instead. This results in a substantial reduction code written, which makes data analysis much cleaner and less bug-prone. GeoViews sits on top of the visualization package HoloViews, with an emphasis on geophysical data. It's also my first good-bye to the aging (10+ years) matplotlib library. It'll still be handy now and then, but it's time to try new things.
Read more…

Titanic Diaries - Part I: Data Dusting

2017-01-09 16:46

Erdem Karaköylü

In most data science tutorials I have seen, a lot of the data clean-up is done in what seems to me casually, an annoying obstacle to get to the sexy Machine Learning bit. I was curious to see what if any difference it made in my Kaggle ranking if I used a somewhat more cautious approach in my data cleanup. My approach was to dumbly follow Datacamp's tutorial and submit my test set labels as a benchmark. The second step is then to use a more elaborate data cleanup process and see whether taking the extra time actually moves my ranking up, or maybe down.