cryptodatapy.transform.clean

Classes

CleanData

Cleans data to improve data quality.

Module Contents

class cryptodatapy.transform.clean.CleanData(df: pandas.DataFrame)

Cleans data to improve data quality.

raw_df
df
excluded_cols = None
outliers = None
yhat = None
repaired_df = None
filtered_df = None
filtered_tickers = None
summary
initialize_summary() None

Initializes summary dataframe with data quality metrics.

check_types() None

Checks data types of columns and converts them to the appropriate data types.

Returns:

CleanData object

Return type:

CleanData

filter_outliers(od_method: str = 'mad', excl_cols: str | list | None = None, **kwargs) CleanData

Filters outliers.

Parameters:
  • od_method (str, {'atr', 'iqr', 'mad', 'z_score', 'ewma', 'stl', 'seasonal_decomp', 'prophet'}, default z_score) – Outlier detection method to use for filtering.

  • excl_cols (str or list) – Name of columns to exclude from outlier filtering.

Returns:

CleanData object

Return type:

CleanData

repair_outliers(imp_method: str = 'interpolate', **kwargs) CleanData

Repairs outliers using an imputation method.

Parameters:

imp_method (str, {"fwd_fill', 'interpolate', 'fcst'}, default 'fwd_fill') – Imputation method used to replace filtered outliers.

Returns:

CleanData object

Return type:

CleanData

filter_avg_trading_val(thresh_val: int = 10000000, window_size: int = 30) CleanData

Filters values below a threshold of average trading value (price * volume/size in quote currency) over some lookback window, replacing them with NaNs.

Parameters:
  • thresh_val (int, default 10,000,000) – Threshold/cut-off for avg trading value.

  • window_size (int, default 30) – Size of rolling window.

Returns:

CleanData object

Return type:

CleanData

filter_missing_vals_gaps(gap_window: int = 30) CleanData

Filters values before a large gap of missing values, replacing them with NaNs.

Parameters:

gap_window (int, default 30) – Size of window where all values are missing (NaNs).

Returns:

CleanData object

Return type:

CleanData

filter_min_nobs(ts_obs: int = 100, cs_obs: int = 2) CleanData

Removes tickers from dataframe if the ticker has less than a minimum number of observations.

Parameters:
  • ts_obs (int, default 100) – Minimum number of observations for field/column over time series.

  • cs_obs (int, default 5) – Minimum number of observations for tickers over the cross-section.

Returns:

CleanData object

Return type:

CleanData

filter_delisted_tickers(method: str = 'replace') CleanData

Removes delisted tickers from dataframe.

Parameters:

method (str, {'replace', 'remove'}, default 'replace') – Method to use for handling delisted tickers.

Returns:

CleanData object

Return type:

CleanData

filter_tickers(tickers_list) CleanData

Removes specified tickers from dataframe.

Parameters:

tickers_list (str or list) – List of tickers to be removed. Can be used to remove tickers to be excluded from data analysis, e.g. stablecoins or indexes.

Returns:

CleanData object

Return type:

CleanData

show_plot(plot_series: tuple = ('BTC', 'close'), compare_series: bool = True) None

Plots clean time series and compares it to the raw series.

Parameters:
  • plot_series (tuple, optional, default('BTC', 'close')) – Plots the time series of a specific (ticker, field) tuple.

  • compare_series (bool, default True) – Compares clean time series with raw series

get(attr='df') pandas.DataFrame

Returns GetData object attribute.

Parameters:

attr (str, {'df', 'outliers', 'yhat', 'filtered_tickers', 'summary'}, default 'df') – GetData object attribute to return

Returns:

CleanData object

Return type:

CleanData