Example 4 - Outlier removal

Here we present how the CoreFieldModel-routine reject_outliers can be used to get a list of records that deviate from the rest, and thus are considered outliers. The method is based on a Naive Bayes-classifier. The biggest contrast to existing outlier rejection schemes is that there is no fixed threshold, but records are rejected based on a probabilistic scheme, that depends on the records surrounding the outliers. We first create a model:

[1]:
from paleokalmag.corefieldmodel import CoreFieldModel

myModel = CoreFieldModel(
    lmax=20,
    gamma=-35,
    R=2800,
    alpha_dip=13.8,
    tau_dip=250,
    alpha_wodip=39.4,
    tau_wodip=393,
    rho=3.8,
)

Next, we load an example dataset:

[2]:
from paleokalmag import ChunkedData

path = './outliers_example.dat'
cdat = ChunkedData(path, delta_t=10)
/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:160: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:166: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:189: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:197: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:198: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


The list of outliers can then be obtained, by running reject_outliers on the data:

[3]:
to_reject = myModel.reject_outliers(cdat, quiet=True)

Analyzing the outliers

To see how outliers are rejected, we take a closer look at the distribution of outliers and data. Let’s start by getting the data in a more accessible form:

[4]:
import numpy as np
from paleokalmag import Data
data = Data(path).data
/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:160: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:166: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:189: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:197: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:198: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


Next we extract the outliers from the data and replace them by NaNs in the original DataFrame:

[5]:
outliers = data.copy()
outliers[['D', 'I', 'F']] = np.nan
for tp, uid, fid in to_reject[:, :3]:
    idx = data.query(
        f'UID == {uid} and FID == {fid}'
    ).index
    # prevent checking the dataframe columns for keys that don't
    # exist (if FID does agree, everything should be there)
    if len(idx):
        outliers.loc[idx, [tp]] = data.loc[idx, [tp]]
        data.loc[idx, [tp]] = np.nan

outliers.dropna(subset=['D', 'I', 'F'], how='all', inplace=True)
data.dropna(subset=['D', 'I', 'F'], how='all', inplace=True)

We can have a look at the spacial distribution now:

[6]:
from matplotlib import pyplot as plt
import cartopy.crs as ccrs
plt.rcParams['font.size'] = '14'


fig = plt.figure(figsize=(9, 5))
proj = ccrs.Mollweide()

ax = fig.add_subplot(111, projection=proj);
x, y, _ = proj.transform_points(
    ccrs.Geodetic(),
    data['lon'].to_numpy(),
    data['lat'].to_numpy(),
).T

ax.scatter(x, y, color='grey', alpha=0.3, zorder=0, label='Data');

xo, yo, _ = proj.transform_points(
    ccrs.Geodetic(),
    outliers['lon'].to_numpy(),
    outliers['lat'].to_numpy(),
).T

ax.scatter(xo, yo, color='C3', alpha=0.3, zorder=0, label='Outliers');

ax.legend();
ax.set_global();
ax.coastlines(zorder=1);
_images/example_4_12_0.png

To illustrate the behaviour of the Naive Bayes-classifier, we consider records in Iceland. To not overload the notebook, we outsource the plotting and import it here:

[7]:
from outlier_plotting import plotLoc

fig = plt.figure(figsize=(9, 5))

axs = np.empty(3, dtype='object')
axs[0] = fig.add_subplot(211)
axs[1] = fig.add_subplot(223)
axs[2] = fig.add_subplot(224)

myModel.name = 'Model'
plotLoc((64.7, -20.5), axs, myModel, data, outliers, R=250)
fig.tight_layout()
_images/example_4_14_0.png

We can get an idea of how the outlier identification works by comparing the records at -1200 in the declination (D) and inclination (I) panel:

  • In the declination panel, three records are clustered while one lies further away. This one (red) is considered an outlier.

  • In the inclination panel, the records are spread out more evenly. Thus even though the model is closer to the top two records, the one most deviating from the model is not rejected, as the data do not indicate a deviation.

Removing the outliers

To remove the outliers, we can pass the rejection list to ChunkedData or Data, to get an instance with the outliers removed:

[8]:
cdat = ChunkedData(path, delta_t=10, rejection_lists=to_reject)
Rejected 86 outliers.
/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:160: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:166: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:180: UserWarning: Records with indices [ 304  786  940  968  971 1061 1197 1272] contain declination, but not inclination! The errors need special treatment!
To be able to use the provided data, these records have been dropped from the output.
/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:189: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:197: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


/builds/sec23/korte/paleokalmag/paleokalmag/data_handling.py:198: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.