cryptodatapy.transform.clean
Classes
Cleans data to improve data quality. |
Module Contents
- class cryptodatapy.transform.clean.CleanData(df: pandas.DataFrame)
Cleans data to improve data quality.
- raw_df
- df
- excluded_cols = None
- outliers = None
- yhat = None
- repaired_df = None
- filtered_df = None
- filtered_tickers = None
- summary
- initialize_summary() None
Initializes summary dataframe with data quality metrics.
- check_types() None
Checks data types of columns and converts them to the appropriate data types.
- Returns:
CleanData object
- Return type:
- filter_outliers(od_method: str = 'mad', excl_cols: str | list | None = None, **kwargs) CleanData
Filters outliers.
- Parameters:
od_method (str, {'atr', 'iqr', 'mad', 'z_score', 'ewma', 'stl', 'seasonal_decomp', 'prophet'}, default z_score) – Outlier detection method to use for filtering.
excl_cols (str or list) – Name of columns to exclude from outlier filtering.
- Returns:
CleanData object
- Return type:
- repair_outliers(imp_method: str = 'interpolate', **kwargs) CleanData
Repairs outliers using an imputation method.
- Parameters:
imp_method (str, {"fwd_fill', 'interpolate', 'fcst'}, default 'fwd_fill') – Imputation method used to replace filtered outliers.
- Returns:
CleanData object
- Return type:
- filter_avg_trading_val(thresh_val: int = 10000000, window_size: int = 30) CleanData
Filters values below a threshold of average trading value (price * volume/size in quote currency) over some lookback window, replacing them with NaNs.
- Parameters:
thresh_val (int, default 10,000,000) – Threshold/cut-off for avg trading value.
window_size (int, default 30) – Size of rolling window.
- Returns:
CleanData object
- Return type:
- filter_missing_vals_gaps(gap_window: int = 30) CleanData
Filters values before a large gap of missing values, replacing them with NaNs.
- Parameters:
gap_window (int, default 30) – Size of window where all values are missing (NaNs).
- Returns:
CleanData object
- Return type:
- filter_min_nobs(ts_obs: int = 100, cs_obs: int = 2) CleanData
Removes tickers from dataframe if the ticker has less than a minimum number of observations.
- Parameters:
ts_obs (int, default 100) – Minimum number of observations for field/column over time series.
cs_obs (int, default 5) – Minimum number of observations for tickers over the cross-section.
- Returns:
CleanData object
- Return type:
- filter_delisted_tickers(method: str = 'replace') CleanData
Removes delisted tickers from dataframe.
- Parameters:
method (str, {'replace', 'remove'}, default 'replace') – Method to use for handling delisted tickers.
- Returns:
CleanData object
- Return type:
- filter_tickers(tickers_list) CleanData
Removes specified tickers from dataframe.
- Parameters:
tickers_list (str or list) – List of tickers to be removed. Can be used to remove tickers to be excluded from data analysis, e.g. stablecoins or indexes.
- Returns:
CleanData object
- Return type:
- show_plot(plot_series: tuple = ('BTC', 'close'), compare_series: bool = True) None
Plots clean time series and compares it to the raw series.
- Parameters:
plot_series (tuple, optional, default('BTC', 'close')) – Plots the time series of a specific (ticker, field) tuple.
compare_series (bool, default True) – Compares clean time series with raw series