cryptodatapy.transform.clean ============================ .. py:module:: cryptodatapy.transform.clean Classes ------- .. autoapisummary:: cryptodatapy.transform.clean.CleanData Module Contents --------------- .. py:class:: CleanData(df: pandas.DataFrame) Cleans data to improve data quality. .. py:attribute:: raw_df .. py:attribute:: df .. py:attribute:: excluded_cols :value: None .. py:attribute:: outliers :value: None .. py:attribute:: yhat :value: None .. py:attribute:: repaired_df :value: None .. py:attribute:: filtered_df :value: None .. py:attribute:: filtered_tickers :value: None .. py:attribute:: summary .. py:method:: initialize_summary() -> None Initializes summary dataframe with data quality metrics. .. py:method:: check_types() -> None Checks data types of columns and converts them to the appropriate data types. :returns: CleanData object :rtype: CleanData .. py:method:: filter_outliers(od_method: str = 'mad', excl_cols: Optional[Union[str, list]] = None, **kwargs) -> CleanData Filters outliers. :param od_method: Outlier detection method to use for filtering. :type od_method: str, {'atr', 'iqr', 'mad', 'z_score', 'ewma', 'stl', 'seasonal_decomp', 'prophet'}, default z_score :param excl_cols: Name of columns to exclude from outlier filtering. :type excl_cols: str or list :returns: CleanData object :rtype: CleanData .. py:method:: repair_outliers(imp_method: str = 'interpolate', **kwargs) -> CleanData Repairs outliers using an imputation method. :param imp_method: Imputation method used to replace filtered outliers. :type imp_method: str, {"fwd_fill', 'interpolate', 'fcst'}, default 'fwd_fill' :returns: CleanData object :rtype: CleanData .. py:method:: filter_avg_trading_val(thresh_val: int = 10000000, window_size: int = 30) -> CleanData Filters values below a threshold of average trading value (price * volume/size in quote currency) over some lookback window, replacing them with NaNs. :param thresh_val: Threshold/cut-off for avg trading value. :type thresh_val: int, default 10,000,000 :param window_size: Size of rolling window. :type window_size: int, default 30 :returns: CleanData object :rtype: CleanData .. py:method:: filter_missing_vals_gaps(gap_window: int = 30) -> CleanData Filters values before a large gap of missing values, replacing them with NaNs. :param gap_window: Size of window where all values are missing (NaNs). :type gap_window: int, default 30 :returns: CleanData object :rtype: CleanData .. py:method:: filter_min_nobs(ts_obs: int = 100, cs_obs: int = 2) -> CleanData Removes tickers from dataframe if the ticker has less than a minimum number of observations. :param ts_obs: Minimum number of observations for field/column over time series. :type ts_obs: int, default 100 :param cs_obs: Minimum number of observations for tickers over the cross-section. :type cs_obs: int, default 5 :returns: CleanData object :rtype: CleanData .. py:method:: filter_delisted_tickers(method: str = 'replace') -> CleanData Removes delisted tickers from dataframe. :param method: Method to use for handling delisted tickers. :type method: str, {'replace', 'remove'}, default 'replace' :returns: CleanData object :rtype: CleanData .. py:method:: filter_tickers(tickers_list) -> CleanData Removes specified tickers from dataframe. :param tickers_list: List of tickers to be removed. Can be used to remove tickers to be excluded from data analysis, e.g. stablecoins or indexes. :type tickers_list: str or list :returns: CleanData object :rtype: CleanData .. py:method:: show_plot(plot_series: tuple = ('BTC', 'close'), compare_series: bool = True) -> None Plots clean time series and compares it to the raw series. :param plot_series: Plots the time series of a specific (ticker, field) tuple. :type plot_series: tuple, optional, default('BTC', 'close') :param compare_series: Compares clean time series with raw series :type compare_series: bool, default True .. py:method:: get(attr='df') -> pandas.DataFrame Returns GetData object attribute. :param attr: GetData object attribute to return :type attr: str, {'df', 'outliers', 'yhat', 'filtered_tickers', 'summary'}, default 'df' :returns: CleanData object :rtype: CleanData