{ "cells": [ { "cell_type": "markdown", "id": "f0b71ac1", "metadata": {}, "source": [ "# Example 4 - Outlier removal" ] }, { "cell_type": "markdown", "id": "b294d955", "metadata": {}, "source": [ "Here we present how the `CoreFieldModel`-routine `reject_outliers` can be used to get a list of records that deviate from the rest, and thus are considered outliers. The method is based on a Naive Bayes-classifier. The biggest contrast to existing outlier rejection schemes is that there is no fixed threshold, but records are rejected based on a probabilistic scheme, that depends on the records surrounding the outliers. We first create a model:" ] }, { "cell_type": "code", "execution_count": null, "id": "75223ebe", "metadata": { "scrolled": true }, "outputs": [], "source": [ "from paleokalmag.corefieldmodel import CoreFieldModel\n", "\n", "myModel = CoreFieldModel(\n", " lmax=20,\n", " gamma=-35,\n", " R=2800,\n", " alpha_dip=13.8,\n", " tau_dip=250,\n", " alpha_wodip=39.4,\n", " tau_wodip=393,\n", " rho=3.8,\n", ")" ] }, { "cell_type": "markdown", "id": "b2128690", "metadata": {}, "source": [ "Next, we load an example dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "f70f6d01", "metadata": {}, "outputs": [], "source": [ "from paleokalmag import ChunkedData\n", "\n", "path = './outliers_example.dat'\n", "cdat = ChunkedData(path, delta_t=10)" ] }, { "cell_type": "markdown", "id": "0f961304", "metadata": {}, "source": [ "The list of outliers can then be obtained, by running `reject_outliers` on the data:" ] }, { "cell_type": "code", "execution_count": null, "id": "4b8967e9", "metadata": {}, "outputs": [], "source": [ "to_reject = myModel.reject_outliers(cdat, quiet=True)" ] }, { "cell_type": "markdown", "id": "39abc038", "metadata": {}, "source": [ "## Analyzing the outliers\n", "To see how outliers are rejected, we take a closer look at the distribution of outliers and data. Let's start by getting the data in a more accessible form:" ] }, { "cell_type": "code", "execution_count": null, "id": "ecb6b7ff", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from paleokalmag import Data\n", "data = Data(path).data" ] }, { "cell_type": "markdown", "id": "e6b4abe5", "metadata": {}, "source": [ "Next we extract the outliers from the data and replace them by `NaN`s in the original `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "id": "1909556f", "metadata": {}, "outputs": [], "source": [ "outliers = data.copy()\n", "outliers[['D', 'I', 'F']] = np.nan\n", "for tp, uid, fid in to_reject[:, :3]:\n", " idx = data.query(\n", " f'UID == {uid} and FID == {fid}'\n", " ).index\n", " # prevent checking the dataframe columns for keys that don't\n", " # exist (if FID does agree, everything should be there)\n", " if len(idx):\n", " outliers.loc[idx, [tp]] = data.loc[idx, [tp]]\n", " data.loc[idx, [tp]] = np.nan\n", "\n", "outliers.dropna(subset=['D', 'I', 'F'], how='all', inplace=True)\n", "data.dropna(subset=['D', 'I', 'F'], how='all', inplace=True)" ] }, { "cell_type": "markdown", "id": "d257af66", "metadata": {}, "source": [ "We can have a look at the spacial distribution now:" ] }, { "cell_type": "code", "execution_count": null, "id": "b1ba1cdd", "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "import cartopy.crs as ccrs\n", "plt.rcParams['font.size'] = '14'\n", "\n", "\n", "fig = plt.figure(figsize=(9, 5))\n", "proj = ccrs.Mollweide()\n", "\n", "ax = fig.add_subplot(111, projection=proj);\n", "x, y, _ = proj.transform_points(\n", " ccrs.Geodetic(),\n", " data['lon'].to_numpy(),\n", " data['lat'].to_numpy(),\n", ").T\n", "\n", "ax.scatter(x, y, color='grey', alpha=0.3, zorder=0, label='Data');\n", "\n", "xo, yo, _ = proj.transform_points(\n", " ccrs.Geodetic(),\n", " outliers['lon'].to_numpy(),\n", " outliers['lat'].to_numpy(),\n", ").T\n", "\n", "ax.scatter(xo, yo, color='C3', alpha=0.3, zorder=0, label='Outliers');\n", "\n", "ax.legend();\n", "ax.set_global();\n", "ax.coastlines(zorder=1);" ] }, { "cell_type": "markdown", "id": "e14c979f", "metadata": {}, "source": [ "To illustrate the behaviour of the Naive Bayes-classifier, we consider records in Iceland. To not overload the notebook, we outsource the plotting and import it here:" ] }, { "cell_type": "code", "execution_count": null, "id": "ed9ccdcf", "metadata": {}, "outputs": [], "source": [ "from outlier_plotting import plotLoc\n", "\n", "fig = plt.figure(figsize=(9, 5))\n", "\n", "axs = np.empty(3, dtype='object')\n", "axs[0] = fig.add_subplot(211)\n", "axs[1] = fig.add_subplot(223)\n", "axs[2] = fig.add_subplot(224)\n", "\n", "myModel.name = 'Model'\n", "plotLoc((64.7, -20.5), axs, myModel, data, outliers, R=250)\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "id": "2330187d", "metadata": {}, "source": [ "We can get an idea of how the outlier identification works by comparing the records at -1200 in the declination (D) and inclination (I) panel:\n", "\n", "* In the declination panel, three records are clustered while one lies further away. This one (red) is considered an outlier.\n", "* In the inclination panel, the records are spread out more evenly. Thus even though the model is closer to the top two records, the one most deviating from the model is not rejected, as the data do not indicate a deviation." ] }, { "cell_type": "markdown", "id": "7dd8afd3", "metadata": {}, "source": [ "## Removing the outliers\n", "To remove the outliers, we can pass the rejection list to `ChunkedData` or `Data`, to get an instance with the outliers removed:" ] }, { "cell_type": "code", "execution_count": null, "id": "9b742c4d", "metadata": {}, "outputs": [], "source": [ "cdat = ChunkedData(path, delta_t=10, rejection_lists=to_reject)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }