Python数据分析(三)pandas resample 重采样

摘要:
下方是pandas中resample方法的定义,帮助文档http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling中有更加详细的解释。defresample(self,rule,how=None,axis=0,fill_method=None,closed=None,label=None,convention='st

下方是pandas中resample方法的定义,帮助文档http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling中有更加详细的解释。

def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
                 label=None, convention='start', kind=None, loffset=None,
                 limit=None, base=0, on=None, level=None):
        """Convenience method for frequency conversion and resampling of time
        series.  Object must have a datetime-like index (DatetimeIndex,
        PeriodIndex, or TimedeltaIndex), or pass datetime-like values
        to the on or level keyword.(数据重采样和频率转换,数据必须有时间类型的索引列)

        Parameters
        ----------
        rule : string
            the offset string or object representing target conversion(代表目标转换的偏移量)
        axis : int, optional, default 0(操作的轴信息)
        closed : {'right', 'left'}
            Which side of bin interval is closed. The default is 'left'
            for all frequency offsets except for 'M', 'A', 'Q', 'BM',
            'BA', 'BQ', and 'W' which all have a default of 'right'.(哪一个方向的间隔是关闭的,)
        label : {'right', 'left'}
            Which bin edge label to label bucket with. The default is 'left'
            for all frequency offsets except for 'M', 'A', 'Q', 'BM',
            'BA', 'BQ', and 'W' which all have a default of 'right'.(区间的哪一个方向的边界标签保留)
        convention : {'start', 'end', 's', 'e'}
            For PeriodIndex only, controls whether to use the start or end of
            `rule`
        kind: {'timestamp', 'period'}, optional
            Pass 'timestamp' to convert the resulting index to a
            ``DateTimeIndex`` or 'period' to convert it to a ``PeriodIndex``.
            By default the input representation is retained.
        loffset : timedelta
            Adjust the resampled time labels
        base : int, default 0
            For frequencies that evenly subdivide 1 day, the "origin" of the
            aggregated intervals. For example, for '5min' frequency, base could
            range from 0 through 4. Defaults to 0
        on : string, optional
            For a DataFrame, column to use instead of index for resampling.
            Column must be datetime-like.

            .. versionadded:: 0.19.0

        level : string or int, optional
            For a MultiIndex, level (name or number) to use for
            resampling.  Level must be datetime-like.

            .. versionadded:: 0.19.0

        Returns
        -------
        Resampler object

        Notes
        -----
        See the `user guide
        <http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling>`_
        for more.

        To learn more about the offset strings, please see `this link
        <http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.

        Examples
        --------

        Start by creating a series with 9 one minute timestamps.(新建频率为1min的时间序列)

        >>> index = pd.date_range('1/1/2000', periods=9, freq='T')
        >>> series = pd.Series(range(9), index=index)
        >>> series
        2000-01-01 00:00:00    0
        2000-01-01 00:01:00    1
        2000-01-01 00:02:00    2
        2000-01-01 00:03:00    3
        2000-01-01 00:04:00    4
        2000-01-01 00:05:00    5
        2000-01-01 00:06:00    6
        2000-01-01 00:07:00    7
        2000-01-01 00:08:00    8
        Freq: T, dtype: int64

        Downsample the series into 3 minute bins and sum the values
        of the timestamps falling into a bin.(下采样为三分钟)

        >>> series.resample('3T').sum()
        2000-01-01 00:00:00     3
        2000-01-01 00:03:00    12
        2000-01-01 00:06:00    21
        Freq: 3T, dtype: int64

        Downsample the series into 3 minute bins as above, but label each
        bin using the right edge instead of the left. Please note that the
        value in the bucket used as the label is not included in the bucket,
        which it labels. For example, in the original series the
        bucket ``2000-01-01 00:03:00`` contains the value 3, but the summed
        value in the resampled bucket with the label ``2000-01-01 00:03:00``
        does not include 3 (if it did, the summed value would be 6, not 3).
        To include this value close the right side of the bin interval as
        illustrated in the example below this one.

        >>> series.resample('3T', label='right').sum()(保留间隔的右侧标签,上一个结果是左侧标签)
        2000-01-01 00:03:00     3
        2000-01-01 00:06:00    12
        2000-01-01 00:09:00    21
        Freq: 3T, dtype: int64

        Downsample the series into 3 minute bins as above, but close the right
        side of the bin interval.(降采样为3分钟)

        >>> series.resample('3T', label='right', closed='right').sum()
        2000-01-01 00:00:00     0
        2000-01-01 00:03:00     6
        2000-01-01 00:06:00    15
        2000-01-01 00:09:00    15
        Freq: 3T, dtype: int64

        Upsample the series into 30 second bins.(生采样为30秒)

        >>> series.resample('30S').asfreq()[0:5] #select first 5 rows
        2000-01-01 00:00:00   0.0
        2000-01-01 00:00:30   NaN
        2000-01-01 00:01:00   1.0
        2000-01-01 00:01:30   NaN
        2000-01-01 00:02:00   2.0
        Freq: 30S, dtype: float64

        Upsample the series into 30 second bins and fill the ``NaN``
        values using the ``pad`` method.(向前0阶保持)
        pad/ffill:用前一个非缺失值去填充该缺失值 
backfill/bfill:用下一个非缺失值填充该缺失值
>>> series.resample('30S').pad()[0:5]
        2000-01-01 00:00:00    0
        2000-01-01 00:00:30    0
        2000-01-01 00:01:00    1
        2000-01-01 00:01:30    1
        2000-01-01 00:02:00    2
        Freq: 30S, dtype: int64

        Upsample the series into 30 second bins and fill the
        ``NaN`` values using the ``bfill`` method.(向后0阶保持)

        >>> series.resample('30S').bfill()[0:5]
        2000-01-01 00:00:00    0
        2000-01-01 00:00:30    1
        2000-01-01 00:01:00    1
        2000-01-01 00:01:30    2
        2000-01-01 00:02:00    2
        Freq: 30S, dtype: int64

        Pass a custom function via ``apply``

        >>> def custom_resampler(array_like):
        ...     return np.sum(array_like)+5

        >>> series.resample('3T').apply(custom_resampler)
        2000-01-01 00:00:00     8
        2000-01-01 00:03:00    17
        2000-01-01 00:06:00    26
        Freq: 3T, dtype: int64

        For a Series with a PeriodIndex, the keyword `convention` can be
        used to control whether to use the start or end of `rule`.

        >>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
                                                        freq='A',
                                                        periods=2))
        >>> s
        2012    1
        2013    2
        Freq: A-DEC, dtype: int64

        Resample by month using 'start' `convention`. Values are assigned to
        the first month of the period.

        >>> s.resample('M', convention='start').asfreq().head()
        2012-01    1.0
        2012-02    NaN
        2012-03    NaN
        2012-04    NaN
        2012-05    NaN
        Freq: M, dtype: float64

        Resample by month using 'end' `convention`. Values are assigned to
        the last month of the period.

        >>> s.resample('M', convention='end').asfreq()
        2012-12    1.0
        2013-01    NaN
        2013-02    NaN
        2013-03    NaN
        2013-04    NaN
        2013-05    NaN
        2013-06    NaN
        2013-07    NaN
        2013-08    NaN
        2013-09    NaN
        2013-10    NaN
        2013-11    NaN
        2013-12    2.0
        Freq: M, dtype: float64

        For DataFrame objects, the keyword ``on`` can be used to specify the
        column instead of the index for resampling.

        >>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])
        >>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')
        >>> df.resample('3T', on='time').sum()
                             a  b  c  d
        time
        2000-01-01 00:00:00  0  3  6  9
        2000-01-01 00:03:00  0  3  6  9
        2000-01-01 00:06:00  0  3  6  9

        For a DataFrame with MultiIndex, the keyword ``level`` can be used to
        specify on level the resampling needs to take place.

        >>> time = pd.date_range('1/1/2000', periods=5, freq='T')
        >>> df2 = pd.DataFrame(data=10*[range(4)],
                               columns=['a', 'b', 'c', 'd'],
                               index=pd.MultiIndex.from_product([time, [1, 2]])
                               )
        >>> df2.resample('3T', level=0).sum()
                             a  b   c   d
        2000-01-01 00:00:00  0  6  12  18
        2000-01-01 00:03:00  0  4   8  12

免责声明:文章转载自《Python数据分析(三)pandas resample 重采样》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇关于latch的一点点理解CMD命令下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

anaconda的虚拟环境sklearn中如何安装pandas

1.打开anaconda的虚拟环境sklearn 2.在命令行中输入conda install pandas回车,我的这边又是漫长等待 3.不妨换一种方式输入pip install pandas回车  4.这样就迅速安装成功了 关注小关,带你脱坑! 希望能帮到大家,问你们要一个赞,你们会给吗,谢谢大家版权声明:本文版权归作者(@攻城狮小关)和博客园共有,...

python数据分析用什么软件?(萌新进)

Python是数据处理常用工具,可以处理数量级从几K至几T不等的数据,具有较高的开发效率和可维护性,还具有较强的通用性和跨平台性,这里就为大家分享几个不错的数据分析工具。 Python数据分析需要安装的第三方扩展库有:Numpy、Pandas、SciPy、Matplotlib、Scikit-Learn、Keras、Gensim、Scrapy等,以下是第三方...

pandas入门——loc与iloc函数

oc与iloc函数 loc函数 import pandas as pd import numpy # 导入数据 df = pd.read_csv(filepath_or_buffer="D://movie.csv") df_new = df.set_index(["country"]) df_new.loc[list(["Canada"])] #...

pandas缺失值处理之——如何消去Nan值对数字型字符串数据类型的影响,让数字型字符串保持原始str类型,而不会自动变为float类型?

在利用pandas处理表格时,往往有时我们用表格做的测试用例往往会设计考一些必填项*故意赋值为空(代表不输入)的测试用例, 比如说我们的手机号、身份证号码、社会统一信用代码等都是数字型字符串。如下所示: pandas读取表格,会把表格中的空单元格置为float类型的Nan值,会导致数字型字符串列的数据类型从原始的str类型自动转换为float类型,如下图...

pandas groupby合并列字符串

在pandas里对于数值字段而言,groupby后可以用sum()、max()等方法进行简单的处理,对于字符串字段, 如果把它们的值拼接在一起,可以用使用str.cat()和lamda方法。 如,将下面表格中的内容,对skill字段按照id进行分组合并。 实现代码: importpandas as pd file_name = 'a.csv' df =...

pandas 几种获取dataframe列名的方式

1)通过columns字段获取,返回一个numpy类型的array print(df_data.columns.values) 2)通过list表列出 print(list(df_data)) 3)df.columns返回index,通过tolist()或者list(df.columns)转换为list类型 print(df_data.columns.to...