Time Series in pandas

摘要:
TimeSeriesimportpandasaspdimportnumpyasnpDateandTimedatatypes和工具fromtimeimportdatetimefromtatetimeimportdatefromtatetime importdatefromdatetimeimporttimedeltanow=datetime.now()nowdatetime.datetime(2020,5,9,51,27,
Time Series
import pandas as pd
import numpy as np

Date and Time data types and tools

from datetime import datetime
from datetime import date
from datetime import timedelta
now=datetime.now()
now
datetime.datetime(2020, 5, 5, 9, 51, 27, 686891)
now.year,now.month,now.day
(2020, 5, 5)
now.time()
datetime.time(9, 51, 27, 686891)

datetime stores both the date and time down to the microsecond.timedelta represents the temporal difference between two datetime objects.

delta=datetime(2020,4,29)-datetime(2018,8,19)
delta
datetime.timedelta(619)
delta.days
619
delta.seconds
0

You can add(or subtract) a timedelta or multiple thereof to a datetime object to yield a new shifted object:

start=datetime(2020,4,30)
start+timedelta(100)
datetime.datetime(2020, 8, 8, 0, 0)
datetime.now()
datetime.datetime(2020, 5, 5, 9, 51, 27, 888352)
date(2008,8,8)-date(2000,8,8)
datetime.timedelta(2922)

Converting between string and datetime

You can format datetime and pandas Timestamp objects ,as string using str or strftime method,passing a format specification.

stamp=datetime(2011,1,3)
str(stamp)
'2011-01-03 00:00:00'
stamp.strftime('%Y--%m-!-%d') # can be meomorized by 'str from time', datetime object instance method
'2011--01-!-03'
description={'%Y':'four-digit year',
           '%y':'two-digit year',
           '%m':'two-digit month[01,12]',
           '%d':'two-digit day[01,31]',
           '%H':'Hour(24-hour clock)[00,23]',
           '%I':'Hour(12-hour clock)[01,12]',
           '%M':'two-digit minute[00,59]',
           '%S':'second[00,61](seconds60,61 account for leap seconds)',
           '%w':'weekday as integer[0(sunday),6]',
           '%U':'week number of the year[00,53];sunday is considered the first day of the first day of the week,and days before the first sunday of the year are "week 0"',
           '%W':'week number of the year[00,53];monday is considered the first day of the week and days before the first monday of the year are "week 0"',
           '%z':'UTC time zone offset as +HHMM or -HHMM;empty if time zone naive',
           '%F':'shortcut for %Y-%m-%d',
           '%D':'shortcut for %m/%d/%y(e.g04/18/12)'}
pd.DataFrame(description,index=[0])
%Y%y%m%d%H%I%M%S%w%U%W%z%F%D
0four-digit yeartwo-digit yeartwo-digit month[01,12]two-digit day[01,31]Hour(24-hour clock)[00,23]Hour(12-hour clock)[01,12]two-digit minute[00,59]second[00,61](seconds60,61 account for leap se...weekday as integer[0(sunday),6]week number of the year[00,53];sunday is consi...week number of the year[00,53];monday is consi...UTC time zone offset as +HHMM or -HHMM;empty i...shortcut for %Y-%m-%dshortcut for %m/%d/%y(e.g04/18/12)
pd.DataFrame(description,index=[0]).stack()
0  %Y                                      four-digit year
   %y                                       two-digit year
   %m                               two-digit month[01,12]
   %d                                 two-digit day[01,31]
   %H                           Hour(24-hour clock)[00,23]
   %I                           Hour(12-hour clock)[01,12]
   %M                              two-digit minute[00,59]
   %S    second[00,61](seconds60,61 account for leap se...
   %w                      weekday as integer[0(sunday),6]
   %U    week number of the year[00,53];sunday is consi...
   %W    week number of the year[00,53];monday is consi...
   %z    UTC time zone offset as +HHMM or -HHMM;empty i...
   %F                                shortcut for %Y-%m-%d
   %D                   shortcut for %m/%d/%y(e.g04/18/12)
dtype: object

You can use these same format codes to convert strings to dates using datetime.strptime:

value='2011-01-03'
datetime.strptime(value,'%Y-%m-%d')  # can be memorized by 'str produce time'  datetime model method
datetime.datetime(2011, 1, 3, 0, 0)
datestrs=['7/6/2011','8/6/2011']
[datetime.strptime(x,'%m/%d/%Y') for x in datestrs]
[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

datetime.strptime is a good way to parse a date with a known format.However,it can be a bit annoyning to have to write a format spec each time,especially for common date formats.In this case,you can use the parser.parse method in the third-party dateutil package(this is installed automatically when you install pandas):

from dateutil.parser import parse  # notice that, it is `parse` not `parser`!
parse('2020--4--30')
datetime.datetime(2020, 4, 30, 0, 0)

dateutil is capable of parsing most human-intelligible date representation:

parse('Jan 31,1997 10:45PM')  #? what happened here?
datetime.datetime(2020, 1, 31, 22, 45)

In international locales,day appearing before month is very common,so you can pass dayfirst=True to indicate this:

parse('6/12/2022',dayfirst=True)
datetime.datetime(2022, 12, 6, 0, 0)

pandas is generally oriented toward working with arrays of date,whether used as an axis index or a column in a DataFrame.The to_datetime method parses many different kinds of date representations.Standard date formats like ISO 8601 can be parsed very quickly.

datestrs=['2011-07-16 12:00:00','2011-08-06 00:00:00']
pd.to_datetime(['2011-07-16 ','2011-08-06 '])
DatetimeIndex(['2011-07-16', '2011-08-06'], dtype='datetime64[ns]', freq=None)
pd.to_datetime(datestrs)
DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

It also handles values that should be considered missing (None,empty string,etc.):

idx=pd.to_datetime(datestrs+[None]);idx
DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)

NaT(not a time) is pandas's null value for timestamp data.

Notice
dateutil.parse is a useful but imperfect tool.

Time series basics

A basic kind of time series object in pandas is a Series indexed by timestamps,which is often represented external to pandas as Python strings or datetime objects.

dates=[datetime(2020,1,2),datetime(2020,1,5),datetime(2020,1,7),datetime(2020,1,8),datetime(2020,1,10),datetime(2020,1,12)]
ts=pd.Series(np.random.randn(6),index=dates);ts
2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64
pd.Series(np.random.randn(3),index=idx)
2011-07-16 12:00:00    2.259805
2011-08-06 00:00:00   -0.877063
NaT                   -0.697678
dtype: float64

Under the hood,these datetime objects have been put in a DatetimeIndex:

ts.index
DatetimeIndex(['2020-01-02', '2020-01-05', '2020-01-07', '2020-01-08',
               '2020-01-10', '2020-01-12'],
              dtype='datetime64[ns]', freq=None)

Like other Series,arithmetic operations between differently indexed time series automatically align on the dates:

ts[::2]
2020-01-02   -1.140949
2020-01-07   -0.046164
2020-01-10   -0.126047
dtype: float64
ts+ts[::2]
2020-01-02   -2.281898
2020-01-05         NaN
2020-01-07   -0.092328
2020-01-08         NaN
2020-01-10   -0.252095
2020-01-12         NaN
dtype: float64
pd.Series([1,2,3])+pd.Series([3,4])
0    4.0
1    6.0
2    NaN
dtype: float64

pandas stores timestamps using Numpy's datetime64 data type at the nanosecond resolution.

ts.index.dtype
dtype('<M8[ns]')

Scalar values from a DatetimeIndex are pandas Timestamp objects:

stamp=ts.index[0];stamp
Timestamp('2020-01-02 00:00:00')

A Timestamp can be substituted anywhere you would use a datetime object.Addtionally,it can store frequency information(if any) and understands how to do time zone conversions and other kinds of manipulations.

Indexing,selection,subsetting

ts
2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64

Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:

stamp=ts.index[2]
ts[stamp]
-0.04616414830843706

As a convenience,you can also pass a string that is interpretable as a date:

ts['1/10/2020']
-0.12604738036158042
ts['2020-1-10']
-0.12604738036158042

For longer time series,a year or only a year and month can be passed to easily select slices of data:

help(pd.date_range)
Help on function date_range in module pandas.core.indexes.datetimes:

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
    Return a fixed frequency DatetimeIndex.
    
    Parameters
    ----------
    start : str or datetime-like, optional
        Left bound for generating dates.
    end : str or datetime-like, optional
        Right bound for generating dates.
    periods : integer, optional
        Number of periods to generate.
    freq : str or DateOffset, default 'D'
        Frequency strings can have multiples, e.g. '5H'. See
        :ref:`here <timeseries.offset_aliases>` for a list of
        frequency aliases.
    tz : str or tzinfo, optional
        Time zone name for returning localized DatetimeIndex, for example
        'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
        timezone-naive.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    closed : {None, 'left', 'right'}, optional
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None, the default).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    rng : DatetimeIndex
    
    See Also
    --------
    DatetimeIndex : An immutable container for datetimes.
    timedelta_range : Return a fixed frequency TimedeltaIndex.
    period_range : Return a fixed frequency PeriodIndex.
    interval_range : Return a fixed frequency IntervalIndex.
    
    Notes
    -----
    Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified. If ``freq`` is omitted, the resulting
    ``DatetimeIndex`` will have ``periods`` linearly spaced elements between
    ``start`` and ``end`` (closed on both sides).
    
    To learn more about the frequency strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    **Specifying the values**
    
    The next four examples generate the same `DatetimeIndex`, but vary
    the combination of `start`, `end` and `periods`.
    
    Specify `start` and `end`, with the default daily frequency.
    
    >>> pd.date_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start` and `periods`, the number of periods (days).
    
    >>> pd.date_range(start='1/1/2018', periods=8)
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `end` and `periods`, the number of periods (days).
    
    >>> pd.date_range(end='1/1/2018', periods=8)
    DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
                   '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start`, `end`, and `periods`; the frequency is generated
    automatically (linearly spaced).
    
    >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
    DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
                   '2018-04-27 00:00:00'],
                  dtype='datetime64[ns]', freq=None)
    
    **Other Parameters**
    
    Changed the `freq` (frequency) to ``'M'`` (month end frequency).
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='M')
    DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
                   '2018-05-31'],
                  dtype='datetime64[ns]', freq='M')
    
    Multiples are allowed
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    `freq` can also be specified as an Offset object.
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    Specify `tz` to set the timezone.
    
    >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
    DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
                   '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
                   '2018-01-05 00:00:00+09:00'],
                  dtype='datetime64[ns, Asia/Tokyo]', freq='D')
    
    `closed` controls whether to include `start` and `end` that are on the
    boundary. The default includes boundary points on either end.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='left'`` to exclude `end` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='right'`` to exclude `start` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
    DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
pd.date_range('2020-05-01',periods=5)
DatetimeIndex(['2020-05-01', '2020-05-02', '2020-05-03', '2020-05-04',
               '2020-05-05'],
              dtype='datetime64[ns]', freq='D')
longer_ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
longer_ts['2001']
2001-01-01   -2.081715
2001-01-02    1.425891
2001-01-03    0.314430
2001-01-04    0.153332
2001-01-05    0.282888
                ...   
2001-12-27   -0.299994
2001-12-28    1.852834
2001-12-29    1.847192
2001-12-30    0.592563
2001-12-31    0.519122
Freq: D, Length: 365, dtype: float64

Here,the string '2001' is interpreted as a year and selects that time period.This also works if you specify the month:

longer_ts['2001-05']
2001-05-01   -0.408720
2001-05-02   -0.958682
2001-05-03   -0.424746
2001-05-04    0.771404
2001-05-05    1.959182
2001-05-06    0.287984
2001-05-07   -0.199789
2001-05-08   -0.369938
2001-05-09    0.309950
2001-05-10   -1.649661
2001-05-11    0.119676
2001-05-12    0.205413
2001-05-13    0.416938
2001-05-14   -0.305450
2001-05-15   -0.126385
2001-05-16    1.665036
2001-05-17    0.627492
2001-05-18   -1.317637
2001-05-19   -2.734170
2001-05-20   -0.163745
2001-05-21   -0.784528
2001-05-22    0.151304
2001-05-23    0.583916
2001-05-24    0.571195
2001-05-25   -1.498402
2001-05-26   -1.485187
2001-05-27    0.411882
2001-05-28    0.323999
2001-05-29    0.627545
2001-05-30   -2.054165
2001-05-31   -1.493494
Freq: D, dtype: float64
longer_ts['2000-05-01']
-0.08042675710861961

Slicing with datetime objects works as well:

ts[datetime(2020,1,7)]
-0.04616414830843706

Because most time series data is ordered chronologically,you can slice with timestamps not contained in a time series to perform a range query:

ts
2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64
ts['2020-1-6':'2020-1-14'] # Notice that,'2020-1-6' and '2020-1-14' are not in ts.
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64

As before,you can pass either a string date,datetime,or timestamp.Remeber that slicing in this manner produces views on the source time series like slicing Numpy arrays.This means that no data is copied and modifications on the slice will be reflected in the original data.

There is an equivalent instance method,truncate,that slices a Series between tow-dates:

ts.truncate(after='1/9/2020')
2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
dtype: float64

All of this holds true for DataFrame as well,indexing on its rows:

dates=pd.date_range('1/1/2000',periods=100,freq='w-wed')
long_df=pd.DataFrame(np.random.randn(100,4),index=dates,columns=['Colorado','Texas','New York','Ohio']) # Attention,type disscussed above is Series,and here,changed to be DataFrame
long_df.loc['5-2001']
ColoradoTexasNew YorkOhio
2001-05-02-0.157680-0.3988690.399008-0.109813
2001-05-09-0.475721-0.5505440.4063080.290822
2001-05-16-1.1013152.469916-0.062604-0.409562
2001-05-230.9724021.035692-0.5949601.255631
2001-05-30-1.823161-0.327003-0.2947910.795953

Time series with duplicate indices

In some applications,there may be multiple data observations failling on a particular timestamp.

dates=pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','1/3/2000'])
dup_ts=pd.Series(np.arange(5),index=dates)
dup_ts
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32
dup_ts.index.is_unique
False

Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:

dup_ts['1/3/2000']
4
dup_ts['1/2/2000']
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

Suppose you wanted to aggregate the data having non-unique timestamps.One way to do this is to use groupby and pass level=0:

grouped=dup_ts.groupby(level=0)
grouped.mean()
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32
grouped.count()
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

Date ranges,frequencies,and shifting

Generic time series in pandas are assumed to be irregular;that is,thet have no fixed frequency.For many applications this is sufficient.However,it's often desirable to work relative to a fixed frequency,such as daily,monthly,or every 15 minutes,even if that means introducing missing values into a time series. Fortunately pandas has a full suite of standard time series frequencies and tools for resampling,inferring frequencies,and generating fixed-frequency date ranges.

ts
2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64
resampler=ts.resample('D');resampler
<pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20270320>

The string 'D' is interpreted as a daily frequency.Conversion between frequencies or resampling is a big enough topic to have its own section.

Generating date ranges

pd.date_range is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency.

index=pd.date_range('2012-04-01','2012-06-01')
index
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

By default,date_range generates daily timestamps.If you pass only a start or end date,you must pass a number of periods to generate:

pd.date_range(start='2012-04-01',periods=20)
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')
pd.date_range(end='2012-06-1',periods=20)
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

The start and end dates define strict boundaries for the generated date index.For example,if you wanted a date index containing the last business day of each month,you would pass the BM frequency(business end of month) and only dates falling on or inside the date interval will be included:

pd.date_range('2020-04-01','2020-07-01',freq='BM') #'BM'-->BusinessMonthEnd
DatetimeIndex(['2020-04-30', '2020-05-29', '2020-06-30'], dtype='datetime64[ns]', freq='BM')
pd.date_range('2020-05-01',periods=10,freq='w-fri') # 'w-mon,w-tue...'--->Weekly on given day of week
DatetimeIndex(['2020-05-01', '2020-05-08', '2020-05-15', '2020-05-22',
               '2020-05-29', '2020-06-05', '2020-06-12', '2020-06-19',
               '2020-06-26', '2020-07-03'],
              dtype='datetime64[ns]', freq='W-FRI')
pd.date_range('2020-05-01',periods=3,freq='b') # 'b'--> BusinessDay
DatetimeIndex(['2020-05-01', '2020-05-04', '2020-05-05'], dtype='datetime64[ns]', freq='B')
pd.date_range('2020-05-01',periods=4,freq='q-feb')# quarterly dates anchored on last calendar day of each month.
DatetimeIndex(['2020-05-31', '2020-08-31', '2020-11-30', '2021-02-28'], dtype='datetime64[ns]', freq='Q-FEB')
pd.date_range('2020-05-01',periods=4,freq='q-mar')# to-->march for the last item
DatetimeIndex(['2020-06-30', '2020-09-30', '2020-12-31', '2021-03-31'], dtype='datetime64[ns]', freq='Q-MAR')

date_range by default preserves the time(if any) of the start or end timestamp.

pd.date_range('2012-05-02 12:56:31',periods=5)
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention.To do this ,there is a normalize option:

pd.date_range('2012-05-02 12:56:31',periods=5,normalize=True)
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

Frequencies and date offsets

Frequencies in pandas are composed of a base frequency and a multiplier.Base frequencies are typically referred to by a string alias,like M for monthly or H for hourly.For each base frequency, there is an object defined generally referred to as a date offset.

from pandas.tseries.offsets import Hour,Minute
hour=Hour()
hour
<Hour>
four_hour=Hour(4)
four_hour
<4 * Hours>

In most applications,you would never need to explicitly create one of these objects,instead using a string alias like 'H' or '4H'.Putting an integer before the base frequency creates a multiple:

pd.date_range('2000-01-01','2000-01-03 23:59',freq='4H')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

Many offsets can be combined together by addition:

Hour(2)+Minute(30)
<150 * Minutes>

Similarly,you can pass frequency strings,like '1h30min',that will effectively be parsed to the same expression:

pd.date_range('2000-01-01',periods=10,freq='1h30min')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

Some frequencies describe points in time that are not evenly spaced.For example,'M'(calendar month end) and 'BM'(last business/weekday of month) depend on the number of days in a month and, in the latter case ,whether the month ends on weekend or not.We refer to these as anchor offset.

week of month dates

One useful frequency class is 'week of month',starting with WOM.This enables you to get dates like the third Fridy of each month.

rng=pd.date_range('2012-01-01','2012-09-01',freq='WOM-3FRI')
list(rng)
[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]

Shifting(Leading and lagging)data

Shifting means moving data backward and forward through time.Both Series and DataFrame have a shift method for doing naive shifts forward,leaving the index unmodified:

ts=pd.Series(np.random.randn(4),index=pd.date_range('1/1/2020',periods=4,freq='M'))
help(ts.shift)
Help on method shift in module pandas.core.series:

shift(periods=1, freq=None, axis=0, fill_value=None) method of pandas.core.series.Series instance
    Shift index by desired number of periods with an optional time `freq`.
    
    When `freq` is not passed, shift the index without realigning the data.
    If `freq` is passed (in this case, the index must be date or datetime,
    or it will raise a `NotImplementedError`), the index will be
    increased using the periods and the `freq`.
    
    Parameters
    ----------
    periods : int
        Number of periods to shift. Can be positive or negative.
    freq : DateOffset, tseries.offsets, timedelta, or str, optional
        Offset to use from the tseries module or time rule (e.g. 'EOM').
        If `freq` is specified then the index values are shifted but the
        data is not realigned. That is, use `freq` if you would like to
        extend the index when shifting and preserve the original data.
    axis : {0 or 'index', 1 or 'columns', None}, default None
        Shift direction.
    fill_value : object, optional
        The scalar value to use for newly introduced missing values.
        the default depends on the dtype of `self`.
        For numeric data, ``np.nan`` is used.
        For datetime, timedelta, or period data, etc. :attr:`NaT` is used.
        For extension dtypes, ``self.dtype.na_value`` is used.
    
        .. versionchanged:: 0.24.0
    
    Returns
    -------
    Series
        Copy of input object, shifted.
    
    See Also
    --------
    Index.shift : Shift values of Index.
    DatetimeIndex.shift : Shift values of DatetimeIndex.
    PeriodIndex.shift : Shift values of PeriodIndex.
    tshift : Shift the time index, using the index's frequency if
        available.
    
    Examples
    --------
    >>> df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],
    ...                    'Col2': [13, 23, 18, 33, 48],
    ...                    'Col3': [17, 27, 22, 37, 52]})
    
    >>> df.shift(periods=3)
       Col1  Col2  Col3
    0   NaN   NaN   NaN
    1   NaN   NaN   NaN
    2   NaN   NaN   NaN
    3  10.0  13.0  17.0
    4  20.0  23.0  27.0
    
    >>> df.shift(periods=1, axis='columns')
       Col1  Col2  Col3
    0   NaN  10.0  13.0
    1   NaN  20.0  23.0
    2   NaN  15.0  18.0
    3   NaN  30.0  33.0
    4   NaN  45.0  48.0
    
    >>> df.shift(periods=3, fill_value=0)
       Col1  Col2  Col3
    0     0     0     0
    1     0     0     0
    2     0     0     0
    3    10    13    17
    4    20    23    27
ts
2020-01-31    0.225376
2020-02-29   -0.024710
2020-03-31    0.117686
2020-04-30    1.513727
Freq: M, dtype: float64
ts.shift(2)
2020-01-31         NaN
2020-02-29         NaN
2020-03-31    0.225376
2020-04-30   -0.024710
Freq: M, dtype: float64
ts.shift(-2)
2020-01-31    0.117686
2020-02-29    1.513727
2020-03-31         NaN
2020-04-30         NaN
Freq: M, dtype: float64

When we shift like this,missing data is introduced either at the start or the end of the time Series.

A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns.

Because naive shift leave the index unmodified,some data is discard.Thus if the frequency is known ,it can be passed to shift to advance the timestamps instead of simply the data:
Shift index by desired number of periods with an optional time freq.

When freq is not passed, shift the index without realigning the data.
If freq is passed (in this case, the index must be date or datetime,
or it will raise a NotImplementedError), the index will be
increased using the periods and the freq.

ts.shift(2,freq='M') 
2020-03-31    0.225376
2020-04-30   -0.024710
2020-05-31    0.117686
2020-06-30    1.513727
Freq: M, dtype: float64
ts
2020-01-31    0.225376
2020-02-29   -0.024710
2020-03-31    0.117686
2020-04-30    1.513727
Freq: M, dtype: float64

Other frequencies can be passed too,giving you some flexibility in how to lead and lag the data:

ts.shift(3,freq='D') # shift every timestamp in ts forward 3 days
2020-02-03    0.225376
2020-03-03   -0.024710
2020-04-03    0.117686
2020-05-03    1.513727
dtype: float64
ts.shift(1,freq='90T')  # shift every timestamp in ts forward '90T'
2020-01-31 01:30:00    0.225376
2020-02-29 01:30:00   -0.024710
2020-03-31 01:30:00    0.117686
2020-04-30 01:30:00    1.513727
Freq: M, dtype: float64

Shift dates with offsets

The pandas date offsets can also be used with datetime or Timestamp objects:

from pandas.tseries.offsets import Day,MonthEnd
now=datetime(2020,11,17)
now+3*Day()
Timestamp('2020-11-20 00:00:00')
now+MonthEnd(2)
Timestamp('2020-12-31 00:00:00')

Anchored offsets can explicitly 'roll' dates forward or backward by simply using their rollforward and rollback methods,respectively:

offset=MonthEnd()
offset.rollforward(now)
Timestamp('2020-11-30 00:00:00')
offset.rollback(now)
Timestamp('2020-10-31 00:00:00')

A creative use of date offset is to use these methods with groupby:

ts=pd.Series(np.random.randn(20),index=pd.date_range('1/15/2000',periods=20,freq='4d'))

ts

list(ts.groupby(offset.rollforward))
[(Timestamp('2000-01-31 00:00:00'),
  2000-01-15   -0.209800
  2000-01-19   -2.189881
  2000-01-23   -1.779681
  2000-01-27    0.437441
  2000-01-31    1.054685
  Freq: 4D, dtype: float64),
 (Timestamp('2000-02-29 00:00:00'),
  2000-02-04   -0.506648
  2000-02-08    0.484109
  2000-02-12   -0.385587
  2000-02-16   -0.732983
  2000-02-20   -1.459167
  2000-02-24   -1.133808
  2000-02-28    0.097860
  Freq: 4D, dtype: float64),
 (Timestamp('2000-03-31 00:00:00'),
  2000-03-03    0.480492
  2000-03-07    1.040105
  2000-03-11    0.634999
  2000-03-15    0.621187
  2000-03-19   -1.410100
  2000-03-23    0.319765
  2000-03-27   -1.079803
  2000-03-31   -1.292514
  Freq: 4D, dtype: float64)]
ts.groupby(offset.rollforward).mean()
2000-01-31   -0.537447
2000-02-29   -0.519461
2000-03-31   -0.085734
dtype: float64

Of course ,an easier and faster way to do this is using resample

ts.resample('M').mean()
2000-01-31   -0.537447
2000-02-29   -0.519461
2000-03-31   -0.085734
Freq: M, dtype: float64

Time zone handling

Working with time zone is generally considered one of the most unpleasant parts of time series manipulation.As a result,many time series users choose to work with time series in coordinated universal time or UTC,which is the successor to Greenwich Mean Time and is the current internaational standard.Time zones are expressed as offsets from UTC;

In Python,time zone information comes from the third-party pytz library,which exposes the Olson database,a compilation of world time zone information. Pandas wraps pytz's funtionality so you can ignore its API outside of the time zone names. Time zone names can be found interactively and in the docs.

import pytz
pytz.common_timezones[-5:]
['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']
len(pytz.common_timezones)
440

To get a time zone object from pytz,use pytz.timezone:

tz=pytz.timezone('America/New_York')
tz
<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

Time zone localization and conversion

By default,time series in pandas are time zone naive, for example,consider the following time series:

rng=pd.date_range('3/9/2020 9:00',periods=6,freq='D')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts
2020-03-09 09:00:00   -0.384109
2020-03-10 09:00:00   -0.195272
2020-03-11 09:00:00   -0.473277
2020-03-12 09:00:00    1.430223
2020-03-13 09:00:00   -0.222399
2020-03-14 09:00:00   -0.844174
Freq: D, dtype: float64
print(ts.index.tz) # print None,indicating that ts.index is naive time zone
None

The index's field is None. Date ranges can be generated with a time zone set:

pd.date_range('3/9/2020 9:30',periods=10,freq='D',tz='UTC')
DatetimeIndex(['2020-03-09 09:30:00+00:00', '2020-03-10 09:30:00+00:00',
               '2020-03-11 09:30:00+00:00', '2020-03-12 09:30:00+00:00',
               '2020-03-13 09:30:00+00:00', '2020-03-14 09:30:00+00:00',
               '2020-03-15 09:30:00+00:00', '2020-03-16 09:30:00+00:00',
               '2020-03-17 09:30:00+00:00', '2020-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

Conversion from naive to localized is handled by tz_localize:

ts
2020-03-09 09:00:00   -0.384109
2020-03-10 09:00:00   -0.195272
2020-03-11 09:00:00   -0.473277
2020-03-12 09:00:00    1.430223
2020-03-13 09:00:00   -0.222399
2020-03-14 09:00:00   -0.844174
Freq: D, dtype: float64
ts_utc=ts.tz_localize('utc')
ts_utc
2020-03-09 09:00:00+00:00   -0.384109
2020-03-10 09:00:00+00:00   -0.195272
2020-03-11 09:00:00+00:00   -0.473277
2020-03-12 09:00:00+00:00    1.430223
2020-03-13 09:00:00+00:00   -0.222399
2020-03-14 09:00:00+00:00   -0.844174
Freq: D, dtype: float64
ts_utc.index # dtype has been changed to be UTC
DatetimeIndex(['2020-03-09 09:00:00+00:00', '2020-03-10 09:00:00+00:00',
               '2020-03-11 09:00:00+00:00', '2020-03-12 09:00:00+00:00',
               '2020-03-13 09:00:00+00:00', '2020-03-14 09:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

Operations with Time zone --Aware Timestamp objects

Similar to time series and date ranges,individual Timestamp objects similarly can be localized from naive to time-aware and converted from one time zone to another:

stamp=pd.Timestamp('2011-03-12 04:00')
print(stamp.tz)
None
stamp_utc=stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')
Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')
ts.index.tz_localize('utc').tz_convert('America/New_York') # naive shall be localized and then can be converted to anoter time zone
DatetimeIndex(['2020-03-09 05:00:00-04:00', '2020-03-10 05:00:00-04:00',
               '2020-03-11 05:00:00-04:00', '2020-03-12 05:00:00-04:00',
               '2020-03-13 05:00:00-04:00', '2020-03-14 05:00:00-04:00'],
              dtype='datetime64[ns, America/New_York]', freq='D')
ts.index.tz_localize('utc').tz
<UTC>

You can also pass a time zone when creating the Timestamp:

stamp_mscow=pd.Timestamp('2011-03-12 04:00',tz='Europe/Moscow')
stamp_mscow.tz
<DstTzInfo 'Europe/Moscow' MSK+3:00:00 STD>

Time zone-aware Timestamp objects internally store a UTC timestamp value as nano-seconds since the Unix-epoch(January 1,1970);this UTC value is invariant between time zone conversions:

stamp_mscow.value
1299891600000000000
stamp_mscow.tz_convert('America/New_York')
Timestamp('2011-03-11 20:00:00-0500', tz='America/New_York')

When performing time arithmetic using pandas's DateOffset objects,pandas respects daylight saving time transitions where possible.Here we construct timestamps that occur right before DST transitions.First,30 minutes before transitioning to DST:

from pandas.tseries.offsets import Hour
stamp=pd.Timestamp('2012-03-12 01:30',tz='US/Eastern')
stamp
Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
stamp+Hour()
Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

Then ,90 minutes before transitioning out of DST:

stamp=pd.Timestamp('2012-10-04 00:30',tz='US/Eastern')
stamp
Timestamp('2012-10-04 00:30:00-0400', tz='US/Eastern')
stamp+2*Hour()  ### 02:30
Timestamp('2012-10-04 02:30:00-0400', tz='US/Eastern')
stamp=pd.Timestamp('2012-11-04 00:30',tz='US/Eastern')
stamp
Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
stamp+2*Hour()  #01:30
Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

Operations between different time zones

If two time series with different time zones are combined,the result will be UTC.Since the timestamps are stored under the hood in UTC,this is a straightforward operation and requires no conversion to happen:

rng=pd.date_range('3/7/2012 9:30',periods=10,freq='B')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts1=ts[:7].tz_localize('Europe/London')
ts2=ts1[2:].tz_convert('Europe/Moscow')
result=ts1+ts2
result.index
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='B')

Periods and period arithmetic

Periods represent timespans,like days,months,quarters,or years.The Periodclass represents this data type,requiring a string or integer.

p=pd.Period(2007,freq='A-DEC') #Annual dates anchored on last calendar day of given month,yearEnd
p
Period('2007', 'A-DEC')

In this case,the Period object represents the full timespan from January1,2007,to December 31,2007,inclusive.Conveniently,adding integers from periods has the effect of shiftting by their frequency:

p+5
Period('2012', 'A-DEC')
p-2
Period('2005', 'A-DEC')

If two periods have the same frequency,their difference is the number of units between them:

pd.Period('2014','A-DEC')-p
<7 * YearEnds: month=12>

Regular ranges of periods can be constructed with the period_range function:

rng=pd.period_range('2000-01-01','2000-06-30',freq='M')
rng
PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

The PeriodIndex class stores a sequence of periods and can serve as an axis index in any pandas data structure:

pd.Series(np.random.randn(6),index=rng)
2000-01    1.337147
2000-02   -0.201512
2000-03   -0.261829
2000-04    0.124229
2000-05   -0.723703
2000-06   -2.130917
Freq: M, dtype: float64

If you have an array of strings,you can also use the PeriodIndex class:

values=['2001Q3','2002Q2','2003Q1']
index=pd.PeriodIndex(values,freq='Q-DEC') # Quarterly dates anchored on last calendar day of each month,for year ending in indicated month.
index
PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

Period frequency convension

Periods and PeriodIndex objects can be converted to another frequency with their asfreq method.As an example,suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year.This is fairly straightforward.

p=pd.Period('2007',freq='A-DEC') # Annual period,ending with the last day of 2007-12
p
Period('2007', 'A-DEC')
p.asfreq('M',how='start')
Period('2007-01', 'M')
p.asfreq('M',how='end')
Period('2007-12', 'M')

You can think of Period('2007','A-DEC') as being a sort of cursor pointing to a span of time,subdivided by monthly periods.

p=pd.Period('2007','A-JUN') #Annual period,ending with the last day of 2007-06
p.asfreq('M','start')
Period('2006-07', 'M')
p.asfreq('M','end')
Period('2007-06', 'M')

Whole PeriodIndex objects or time series can be similarly converted with the same semantics:

rng=pd.period_range('2006','2009',freq='A-DEC')
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts
2006   -0.297145
2007   -0.304496
2008    0.705818
2009   -1.829369
Freq: A-DEC, dtype: float64
ts.asfreq('M',how='start')
2006-01   -0.297145
2007-01   -0.304496
2008-01    0.705818
2009-01   -1.829369
Freq: M, dtype: float64

Here,the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period.If we instead wanted the last business day of each year, we can use 'B' frequency and indicate that we want the end of the period:

ts.asfreq('B',how='end')
2006-12-29   -0.297145
2007-12-31   -0.304496
2008-12-31    0.705818
2009-12-31   -1.829369
Freq: B, dtype: float64

Quarterly period frequencies

Quarterly data is stanard in accounting,finance,and other fields.Much quarterly data is reported relative to a fiscal year end,typically the last calender or business day of one of the 12 months of the year.Thus,the period 2012Q4 has a different meaning depending on fiscal year end.pandas supports all 12 possible quarterly frequencies as Q-JAN through Q-DEC.

p=pd.Period('2012Q4',freq='Q-JAN')
p
Period('2012Q4', 'Q-JAN')

In the case of fiscal year ending in January. 2012Q4 runs from November through January,whcin you can check by converting to daily frequency.

p.asfreq('D','start') #check p by converting to daily frequency.
Period('2011-11-01', 'D')
p.asfreq('D','end')
Period('2012-01-31', 'D')

Thus,it's possible to do easy period arithmetic,for example,to get the timestamp at 4PM on the second-to-last business day of the quarter,

p4pm=(p.asfreq('B','e')-1).asfreq('T','s')+16*60  #'T' means 'Minute'
p4pm
Period('2012-01-30 16:00', 'T')
p4pm.to_timestamp()
Timestamp('2012-01-30 16:00:00')

You can generate quarterly ranges using period_range.

rng=pd.period_range('2011Q3','2012Q4',freq='Q-JAN')
ts=pd.Series(np.arange(len(rng)),index=rng)
ts
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32
new_rng=(rng.asfreq('B','e')-1).asfreq('T','s')+16*60
ts.index=new_rng.to_timestamp()
ts
2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

Converting Timestamps to Periods(and Back

Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period method:

rng=pd.date_range('2000-01-01',periods=3,freq='M')
ts=pd.Series(np.random.randn(3),index=rng);ts
2000-01-31   -0.115457
2000-02-29   -0.318769
2000-03-31    0.166398
Freq: M, dtype: float64
pts=ts.to_period();pts
2000-01   -0.115457
2000-02   -0.318769
2000-03    0.166398
Freq: M, dtype: float64
type(pts)
pandas.core.series.Series

Since periods refer to non-overlapping timespans,a timestamp can only belong to a single period for a given frequency.While the frequency of the new PeriodIndex is inferred from the timestamp by default,you can specify any frequency you want.

rng=pd.date_range('1/29/2000',periods=6,freq='D')
ts2=pd.Series(np.random.randn(6),index=rng)
ts2
2000-01-29    0.511537
2000-01-30    2.661260
2000-01-31    0.954388
2000-02-01   -0.903825
2000-02-02   -0.399345
2000-02-03    1.160727
Freq: D, dtype: float64
ts2.to_period('M')
2000-01    0.511537
2000-01    2.661260
2000-01    0.954388
2000-02   -0.903825
2000-02   -0.399345
2000-02    1.160727
Freq: M, dtype: float64

To convert back to timestamps,use to_timestamp:

pts=ts2.to_period()
pts.to_timestamp(how='end')
2000-01-29 23:59:59.999999999    0.511537
2000-01-30 23:59:59.999999999    2.661260
2000-01-31 23:59:59.999999999    0.954388
2000-02-01 23:59:59.999999999   -0.903825
2000-02-02 23:59:59.999999999   -0.399345
2000-02-03 23:59:59.999999999    1.160727
Freq: D, dtype: float64

Creating a PeriodIndex from Arrays

data=pd.read_csv('.pydata-book-2nd-editionexamplesmacrodata.csv')
data.head(5)
yearquarterrealgdprealconsrealinvrealgovtrealdpicpim1tbilrateunemppopinflrealint
01959.01.02710.3491707.4286.898470.0451886.928.98139.72.825.8177.1460.000.00
11959.02.02778.8011733.7310.859481.3011919.729.15141.73.085.1177.8302.340.74
21959.03.02775.4881751.8289.226491.2601916.429.35140.53.825.3178.6572.741.09
31959.04.02785.2041753.7299.356484.0521931.329.37140.04.335.6179.3860.274.06
41960.01.02847.6991770.5331.722462.1991955.529.54139.63.505.2180.0072.311.19
data.year
0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
        ...  
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, Length: 203, dtype: float64
data.quarter
0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
      ... 
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, Length: 203, dtype: float64

By passing these arrays to PeriodIndex with a frequency,you can combine them to form an index for the DataFrame:

index=pd.PeriodIndex(year=data.year,quarter=data.quarter,freq='Q-DEC')
index
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')
data.index=index

data.infl
1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
          ... 
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64

Resampling and frequency conversion

Resampling refers to the process of converting a time series from one frequency to another.Aggregating higher frequency data to lower frequency is called downsampling, while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories;

pandas objects are equipped with a resample method,which is the workhourse function for all frequency conversion.resample has a similar API to groupby;you can call resemple to group the data,then call an aggregate functions:

rng=pd.date_range('2000-01-01',periods=5,freq='D')
ts=pd.Series(np.arange(5),index=rng)
ts.head()
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
Freq: D, dtype: int32
ts.resample('M').mean()
2000-01-31    2
Freq: M, dtype: int32
ts.resample('M',kind='period').mean()
2000-01    2
Freq: M, dtype: int32
help(pd.Series.resample)
Help on function resample in module pandas.core.generic:

resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)
    Resample time-series data.
    
    Convenience method for frequency conversion and resampling of time
    series. Object must have a datetime-like index (`DatetimeIndex`,
    `PeriodIndex`, or `TimedeltaIndex`), or pass datetime-like values
    to the `on` or `level` keyword.
    
    Parameters
    ----------
    rule : DateOffset, Timedelta or str
        The offset string or object representing target conversion.
    how : str
        Method for down/re-sampling, default to 'mean' for downsampling.
    
        .. deprecated:: 0.18.0
           The new syntax is ``.resample(...).mean()``, or
           ``.resample(...).apply(<func>)``
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Which axis to use for up- or down-sampling. For `Series` this
        will default to 0, i.e. along the rows. Must be
        `DatetimeIndex`, `TimedeltaIndex` or `PeriodIndex`.
    fill_method : str, default None
        Filling method for upsampling.
    
        .. deprecated:: 0.18.0
           The new syntax is ``.resample(...).<func>()``,
           e.g. ``.resample(...).pad()``
    closed : {'right', 'left'}, default None
        Which side of bin interval is closed. The default is 'left'
        for all frequency offsets except for 'M', 'A', 'Q', 'BM',
        'BA', 'BQ', and 'W' which all have a default of 'right'.
    label : {'right', 'left'}, default None
        Which bin edge label to label bucket with. The default is 'left'
        for all frequency offsets except for 'M', 'A', 'Q', 'BM',
        'BA', 'BQ', and 'W' which all have a default of 'right'.
    convention : {'start', 'end', 's', 'e'}, default 'start'
        For `PeriodIndex` only, controls whether to use the start or
        end of `rule`.
    kind : {'timestamp', 'period'}, optional, default None
        Pass 'timestamp' to convert the resulting index to a
        `DateTimeIndex` or 'period' to convert it to a `PeriodIndex`.
        By default the input representation is retained.
    loffset : timedelta, default None
        Adjust the resampled time labels.
    limit : int, default None
        Maximum size gap when reindexing with `fill_method`.
    
        .. deprecated:: 0.18.0
    base : int, default 0
        For frequencies that evenly subdivide 1 day, the "origin" of the
        aggregated intervals. For example, for '5min' frequency, base could
        range from 0 through 4. Defaults to 0.
    on : str, optional
        For a DataFrame, column to use instead of index for resampling.
        Column must be datetime-like.
    
        .. versionadded:: 0.19.0
    
    level : str or int, optional
        For a MultiIndex, level (name or number) to use for
        resampling. `level` must be datetime-like.
    
        .. versionadded:: 0.19.0
    
    Returns
    -------
    Resampler object
    
    See Also
    --------
    groupby : Group by mapping, function, label, or list of labels.
    Series.resample : Resample a Series.
    DataFrame.resample: Resample a DataFrame.
    
    Notes
    -----
    See the `user guide
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling>`_
    for more.
    
    To learn more about the offset strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects>`__.
    
    Examples
    --------
    
    Start by creating a series with 9 one minute timestamps.
    
    >>> index = pd.date_range('1/1/2000', periods=9, freq='T')
    >>> series = pd.Series(range(9), index=index)
    >>> series
    2000-01-01 00:00:00    0
    2000-01-01 00:01:00    1
    2000-01-01 00:02:00    2
    2000-01-01 00:03:00    3
    2000-01-01 00:04:00    4
    2000-01-01 00:05:00    5
    2000-01-01 00:06:00    6
    2000-01-01 00:07:00    7
    2000-01-01 00:08:00    8
    Freq: T, dtype: int64
    
    Downsample the series into 3 minute bins and sum the values
    of the timestamps falling into a bin.
    
    >>> series.resample('3T').sum()
    2000-01-01 00:00:00     3
    2000-01-01 00:03:00    12
    2000-01-01 00:06:00    21
    Freq: 3T, dtype: int64
    
    Downsample the series into 3 minute bins as above, but label each
    bin using the right edge instead of the left. Please note that the
    value in the bucket used as the label is not included in the bucket,
    which it labels. For example, in the original series the
    bucket ``2000-01-01 00:03:00`` contains the value 3, but the summed
    value in the resampled bucket with the label ``2000-01-01 00:03:00``
    does not include 3 (if it did, the summed value would be 6, not 3).
    To include this value close the right side of the bin interval as
    illustrated in the example below this one.
    
    >>> series.resample('3T', label='right').sum()
    2000-01-01 00:03:00     3
    2000-01-01 00:06:00    12
    2000-01-01 00:09:00    21
    Freq: 3T, dtype: int64
    
    Downsample the series into 3 minute bins as above, but close the right
    side of the bin interval.
    
    >>> series.resample('3T', label='right', closed='right').sum()
    2000-01-01 00:00:00     0
    2000-01-01 00:03:00     6
    2000-01-01 00:06:00    15
    2000-01-01 00:09:00    15
    Freq: 3T, dtype: int64
    
    Upsample the series into 30 second bins.
    
    >>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows
    2000-01-01 00:00:00   0.0
    2000-01-01 00:00:30   NaN
    2000-01-01 00:01:00   1.0
    2000-01-01 00:01:30   NaN
    2000-01-01 00:02:00   2.0
    Freq: 30S, dtype: float64
    
    Upsample the series into 30 second bins and fill the ``NaN``
    values using the ``pad`` method.
    
    >>> series.resample('30S').pad()[0:5]
    2000-01-01 00:00:00    0
    2000-01-01 00:00:30    0
    2000-01-01 00:01:00    1
    2000-01-01 00:01:30    1
    2000-01-01 00:02:00    2
    Freq: 30S, dtype: int64
    
    Upsample the series into 30 second bins and fill the
    ``NaN`` values using the ``bfill`` method.
    
    >>> series.resample('30S').bfill()[0:5]
    2000-01-01 00:00:00    0
    2000-01-01 00:00:30    1
    2000-01-01 00:01:00    1
    2000-01-01 00:01:30    2
    2000-01-01 00:02:00    2
    Freq: 30S, dtype: int64
    
    Pass a custom function via ``apply``
    
    >>> def custom_resampler(array_like):
    ...     return np.sum(array_like) + 5
    ...
    >>> series.resample('3T').apply(custom_resampler)
    2000-01-01 00:00:00     8
    2000-01-01 00:03:00    17
    2000-01-01 00:06:00    26
    Freq: 3T, dtype: int64
    
    For a Series with a PeriodIndex, the keyword `convention` can be
    used to control whether to use the start or end of `rule`.
    
    Resample a year by quarter using 'start' `convention`. Values are
    assigned to the first quarter of the period.
    
    >>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
    ...                                             freq='A',
    ...                                             periods=2))
    >>> s
    2012    1
    2013    2
    Freq: A-DEC, dtype: int64
    >>> s.resample('Q', convention='start').asfreq()
    2012Q1    1.0
    2012Q2    NaN
    2012Q3    NaN
    2012Q4    NaN
    2013Q1    2.0
    2013Q2    NaN
    2013Q3    NaN
    2013Q4    NaN
    Freq: Q-DEC, dtype: float64
    
    Resample quarters by month using 'end' `convention`. Values are
    assigned to the last month of the period.
    
    >>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
    ...                                                   freq='Q',
    ...                                                   periods=4))
    >>> q
    2018Q1    1
    2018Q2    2
    2018Q3    3
    2018Q4    4
    Freq: Q-DEC, dtype: int64
    >>> q.resample('M', convention='end').asfreq()
    2018-03    1.0
    2018-04    NaN
    2018-05    NaN
    2018-06    2.0
    2018-07    NaN
    2018-08    NaN
    2018-09    3.0
    2018-10    NaN
    2018-11    NaN
    2018-12    4.0
    Freq: M, dtype: float64
    
    For DataFrame objects, the keyword `on` can be used to specify the
    column instead of the index for resampling.
    
    >>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
    ...           'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
    >>> df = pd.DataFrame(d)
    >>> df['week_starting'] = pd.date_range('01/01/2018',
    ...                                     periods=8,
    ...                                     freq='W')
    >>> df
       price  volume week_starting
    0     10      50    2018-01-07
    1     11      60    2018-01-14
    2      9      40    2018-01-21
    3     13     100    2018-01-28
    4     14      50    2018-02-04
    5     18     100    2018-02-11
    6     17      40    2018-02-18
    7     19      50    2018-02-25
    >>> df.resample('M', on='week_starting').mean()
                   price  volume
    week_starting
    2018-01-31     10.75    62.5
    2018-02-28     17.00    60.0
    
    For a DataFrame with MultiIndex, the keyword `level` can be used to
    specify on which level the resampling needs to take place.
    
    >>> days = pd.date_range('1/1/2000', periods=4, freq='D')
    >>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
    ...            'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
    >>> df2 = pd.DataFrame(d2,
    ...                    index=pd.MultiIndex.from_product([days,
    ...                                                     ['morning',
    ...                                                      'afternoon']]
    ...                                                     ))
    >>> df2
                          price  volume
    2000-01-01 morning       10      50
               afternoon     11      60
    2000-01-02 morning        9      40
               afternoon     13     100
    2000-01-03 morning       14      50
               afternoon     18     100
    2000-01-04 morning       17      40
               afternoon     19      50
    >>> df2.resample('D', level=0).sum()
                price  volume
    2000-01-01     21     110
    2000-01-02     22     140
    2000-01-03     32     150
    2000-01-04     36      90
help(pd.MultiIndex.from_product)
Help on method from_product in module pandas.core.indexes.multi:

from_product(iterables, sortorder=None, names=None) method of builtins.type instance
    Make a MultiIndex from the cartesian product of multiple iterables.
    
    Parameters
    ----------
    iterables : list / sequence of iterables
        Each iterable has unique labels for each level of the index.
    sortorder : int or None
        Level of sortedness (must be lexicographically sorted by that
        level).
    names : list / sequence of str, optional
        Names for the levels in the index.
    
    Returns
    -------
    index : MultiIndex
    
    See Also
    --------
    MultiIndex.from_arrays : Convert list of arrays to MultiIndex.
    MultiIndex.from_tuples : Convert list of tuples to MultiIndex.
    MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
    
    Examples
    --------
    >>> numbers = [0, 1, 2]
    >>> colors = ['green', 'purple']
    >>> pd.MultiIndex.from_product([numbers, colors],
    ...                            names=['number', 'color'])
    MultiIndex([(0,  'green'),
                (0, 'purple'),
                (1,  'green'),
                (1, 'purple'),
                (2,  'green'),
                (2, 'purple')],
               names=['number', 'color'])
tngt=pd.date_range('1/1/2020',periods=9,freq='T')
tngt
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
               '2020-01-01 00:02:00', '2020-01-01 00:03:00',
               '2020-01-01 00:04:00', '2020-01-01 00:05:00',
               '2020-01-01 00:06:00', '2020-01-01 00:07:00',
               '2020-01-01 00:08:00'],
              dtype='datetime64[ns]', freq='T')
sert=pd.Series(np.arange(9),index=tngt)
sert
2020-01-01 00:00:00    0
2020-01-01 00:01:00    1
2020-01-01 00:02:00    2
2020-01-01 00:03:00    3
2020-01-01 00:04:00    4
2020-01-01 00:05:00    5
2020-01-01 00:06:00    6
2020-01-01 00:07:00    7
2020-01-01 00:08:00    8
Freq: T, dtype: int32
sert.resample('30S')
<pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20423860>
help(pd.core.resample.DatetimeIndexResampler.asfreq)
Help on function asfreq in module pandas.core.resample:

asfreq(self, fill_value=None)
    Return the values at the new freq, essentially a reindex.
    
    Parameters
    ----------
    fill_value : scalar, optional
        Value to use for missing values, applied during upsampling (note
        this does not fill NaNs that already were present).
    
        .. versionadded:: 0.20.0
    
    Returns
    -------
    DataFrame or Series
        Values at the specified freq.
    
    See Also
    --------
    Series.asfreq
    DataFrame.asfreq
sert.resample('30S').asfreq()
2020-01-01 00:00:00    0.0
2020-01-01 00:00:30    NaN
2020-01-01 00:01:00    1.0
2020-01-01 00:01:30    NaN
2020-01-01 00:02:00    2.0
2020-01-01 00:02:30    NaN
2020-01-01 00:03:00    3.0
2020-01-01 00:03:30    NaN
2020-01-01 00:04:00    4.0
2020-01-01 00:04:30    NaN
2020-01-01 00:05:00    5.0
2020-01-01 00:05:30    NaN
2020-01-01 00:06:00    6.0
2020-01-01 00:06:30    NaN
2020-01-01 00:07:00    7.0
2020-01-01 00:07:30    NaN
2020-01-01 00:08:00    8.0
Freq: 30S, dtype: float64
ts.resample('M',kind='period').mean()
2000-01    2
Freq: M, dtype: int32
ts
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
Freq: D, dtype: int32
ts.resample('M',kind='timestamp').mean()
2000-01-31    2
Freq: M, dtype: int32
ts.resample('M').mean()
2000-01-31    2
Freq: M, dtype: int32

resample is a flexible and high-performance method that can be used to process very large time series.

The argument and its meaning:

  • freq: string or DateOffset indicating desired resampled frequency(e.g 'M','5min' ,or Second(15)
  • axis : axis to resample on;default axis=0
  • fill_method: how to interpolate when upsampling,as in 'ffill' or 'bfill'; by default does not interpolation
  • closed: in downsampling,which end of each interval is closed(inclusive),'right' or 'left'
  • label: in downsampling,how to label the aggregated result,with the 'right' or 'left' bin edge(e.g, the 9:30 to 9:35 five-minute interval could be labeled 9:30 or 9:35)
  • loffset:Time adjustment to the bin labels,such as '-1s'/Second(-1) to shift the aggregate labels one second erlier
  • limit:when forward or backward filling,the maxinum number of periods to fill
  • kind:Aggregate to periods('period') or timestamps('timestamp');default to the type of index the time series has
  • convention: when resampling periods,the convention('start' or 'end') for converting the low-frequency period to high frequency;defaults to 'end'.

Downsampling

Aggregating data to a regular,lower frequency is a pretty normal time series task.The data you are aggregating does not need to be fixed frequently;the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate.For example, to convert to monthly,'M' or 'BM',you need to chop up the data into one-month interval,and the union of the intervals must make up the whole time frame.There are a couple things to think about when using resemple to downsample data:

  • Which side of each interval is closed
  • How to label each aggregated bin,either with the start of the interval or the end
rng=pd.date_range('2000-01-01',periods=12,freq='T')
ts=pd.Series(np.arange(12),index=rng)
ts
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32
ts.resample('5min',closed='right').sum() # when closed='right',bin is the form of (],the index of ts is the right side of bin
#(23:55:00],(00:00,00:05],(00:05,00:10],(00:10,00:15]
1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32
ts.resample('5min',closed='right',label='right').sum() #The same as above,but showing labels
#(00:00,00:05],(00:05,00:10],(00:10,00:15]
2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32
ts.resample('5min',closed='left').sum()# [00:00,00:05),[00:05,00:10),[00:10,00:15), when closed='left',bin is the form of [),and 
# the index of ts is the left side of bin.
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32
ts.resample('5min',closed='left',label='right').sum()# The same as above,but showing labels.
2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
2000-01-01 00:15:00    21
Freq: 5T, dtype: int32

Lastly,you might want to shift the result index by some amount,say subtracting one second from the right edge to make it more clear which interval the timestamp refers to.To do this,pass a string or date offset to loffset:

ts.resample('5min',closed='right',label='right',loffset='-1s').sum()
1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int32

You also could have accomplished the effect of loffset by calling the shift method on the result without the loffset.

ts.resample('5min',closed='right').sum().shift(1,'-1s')
1999-12-31 23:54:59     0
1999-12-31 23:59:59    15
2000-01-01 00:04:59    40
2000-01-01 00:09:59    11
Freq: 5T, dtype: int32

Open-High-Low-Close(OHLC) resampling

In finance,a popular way to aggregate a time series is to compute four values for each bucket:the first(open),last(close),maximum(high),minimal(low) values.By using the ohlc aggregate function you will obtain a DataFrame having columns containing these four aggregates,which are effectively computed in a single sweep of the data:

ts.resample('5min').ohlc()
openhighlowclose
2000-01-01 00:00:000404
2000-01-01 00:05:005959
2000-01-01 00:10:0010111011

Upsampling and interpolation

When converting from a low frequency to a higher frequency,no aggregation is needed.

frame=pd.DataFrame(np.random.randn(2,4),index=pd.date_range('1/1/2000',periods=2,freq='W-WED'),columns=['Colorado','Texa','New York','Ohio'])
frame
ColoradoTexaNew YorkOhio
2000-01-05-0.7391090.7812230.570884-1.616556
2000-01-12-0.1799022.531560-2.6580080.946870

When you are using an aggregation function with this data,there is only one value per group,and missinig values result in the gaps.We use the
asfreq mehtond to convert the higher frequency without any aggregation:

df_daily=frame.resample('D').asfreq()
df_daily
ColoradoTexaNew YorkOhio
2000-01-05-0.7391090.7812230.570884-1.616556
2000-01-06NaNNaNNaNNaN
2000-01-07NaNNaNNaNNaN
2000-01-08NaNNaNNaNNaN
2000-01-09NaNNaNNaNNaN
2000-01-10NaNNaNNaNNaN
2000-01-11NaNNaNNaNNaN
2000-01-12-0.1799022.531560-2.6580080.946870

Suppose you wanted to fill forward each weekly value on the non-Wednesdays.The same filling or interpolation methods available in the fillna and reindex methods are available for resampling:

frame.resample('D').ffill()
ColoradoTexaNew YorkOhio
2000-01-05-0.7391090.7812230.570884-1.616556
2000-01-06-0.7391090.7812230.570884-1.616556
2000-01-07-0.7391090.7812230.570884-1.616556
2000-01-08-0.7391090.7812230.570884-1.616556
2000-01-09-0.7391090.7812230.570884-1.616556
2000-01-10-0.7391090.7812230.570884-1.616556
2000-01-11-0.7391090.7812230.570884-1.616556
2000-01-12-0.1799022.531560-2.6580080.946870

You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:

frame.resample('D').ffill(limit=2)
ColoradoTexaNew YorkOhio
2000-01-05-0.7391090.7812230.570884-1.616556
2000-01-06-0.7391090.7812230.570884-1.616556
2000-01-07-0.7391090.7812230.570884-1.616556
2000-01-08NaNNaNNaNNaN
2000-01-09NaNNaNNaNNaN
2000-01-10NaNNaNNaNNaN
2000-01-11NaNNaNNaNNaN
2000-01-12-0.1799022.531560-2.6580080.946870

Notably,the new date index need not overlap with the old one at all:

frame.resample('W-THU').ffill()
ColoradoTexaNew YorkOhio
2000-01-06-0.7391090.7812230.570884-1.616556
2000-01-13-0.1799022.531560-2.6580080.946870

Resample with periods

Resampling data indexed by periods is similar to timestamps:

frame=pd.DataFrame(np.random.randn(24,4),index=pd.period_range('1-2000','12-2001',freq='M'),columns=['Colorado','Texas','New York','Ohio'])
frame[:5]
ColoradoTexasNew YorkOhio
2000-01-0.829993-1.1294301.3200360.275144
2000-022.511115-0.3065410.4729830.220395
2000-03-0.0376560.7766380.428096-0.274698
2000-04-1.116895-0.353303-0.6422741.469136
2000-050.975105-1.1609830.459956-0.834690
annual_frame=frame.resample('A-DEC').mean()
annual_frame
ColoradoTexasNew YorkOhio
20000.002307-0.277951-0.030337-0.273060
20010.1265810.275503-0.2095500.241073

Upsampling is more nuance ,as you must make a decision about which end of the timespan in the new frequency to place the values before resampling,just like the asfreq method.The convention argument defaults to start but can also be end.

annual_frame.resample('Q-DEC').ffill()
ColoradoTexasNew YorkOhio
2000Q10.002307-0.277951-0.030337-0.273060
2000Q20.002307-0.277951-0.030337-0.273060
2000Q30.002307-0.277951-0.030337-0.273060
2000Q40.002307-0.277951-0.030337-0.273060
2001Q10.1265810.275503-0.2095500.241073
2001Q20.1265810.275503-0.2095500.241073
2001Q30.1265810.275503-0.2095500.241073
2001Q40.1265810.275503-0.2095500.241073
annual_frame.resample('Q-DEC',convention='end').ffill()
ColoradoTexasNew YorkOhio
2000Q40.002307-0.277951-0.030337-0.273060
2001Q10.002307-0.277951-0.030337-0.273060
2001Q20.002307-0.277951-0.030337-0.273060
2001Q30.002307-0.277951-0.030337-0.273060
2001Q40.1265810.275503-0.2095500.241073

Moving window functions

An important class of array transformations used for time Series operations are statistics and other functinos evaluated over a sliding window or with exponentially decaying weights.This can be useful for smoothing noisy or gappy data.

close_px_all=pd.read_csv(r'./pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px_all.head()
AAPLMSFTXOMSPX
2003-01-027.4021.1129.22909.03
2003-01-037.4521.1429.24908.59
2003-01-067.4521.5229.96929.01
2003-01-077.4321.9328.95922.93
2003-01-087.2821.3128.83909.93
close_px=close_px_all[['AAPL','MSFT','XOM']]
close_px=close_px.resample('B').ffill()

rollingoperator behaves similarly to resample and groupby.It can be called on a Series or DataFrame along with a window:

close_px.AAPL.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f2085ccf8>

png

close_px.AAPL.rolling(300).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f20abec50>

png

close_px.AAPL.rolling(250).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f20b7fb00>

png

help(close_px.AAPL.rolling)
Help on method rolling in module pandas.core.generic:

rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None) method of pandas.core.series.Series instance
    Provide rolling window calculations.
    
    .. versionadded:: 0.18.0
    
    Parameters
    ----------
    window : int, or offset
        Size of the moving window. This is the number of observations used for
        calculating the statistic. Each window will be a fixed size.
    
        If its an offset then this will be the time period of each window. Each
        window will be a variable sized based on the observations included in
        the time-period. This is only valid for datetimelike indexes. This is
        new in 0.19.0
    min_periods : int, default None
        Minimum number of observations in window required to have a value
        (otherwise result is NA). For a window that is specified by an offset,
        `min_periods` will default to 1. Otherwise, `min_periods` will default
        to the size of the window.
    center : bool, default False
        Set the labels at the center of the window.
    win_type : str, default None
        Provide a window type. If ``None``, all points are evenly weighted.
        See the notes below for further information.
    on : str, optional
        For a DataFrame, a datetime-like column on which to calculate the rolling
        window, rather than the DataFrame's index. Provided integer column is
        ignored and excluded from result since an integer index is not used to
        calculate the rolling window.
    axis : int or str, default 0
    closed : str, default None
        Make the interval closed on the 'right', 'left', 'both' or
        'neither' endpoints.
        For offset-based windows, it defaults to 'right'.
        For fixed windows, defaults to 'both'. Remaining cases not implemented
        for fixed windows.
    
        .. versionadded:: 0.20.0
    
    Returns
    -------
    a Window or Rolling sub-classed for the particular operation
    
    See Also
    --------
    expanding : Provides expanding transformations.
    ewm : Provides exponential weighted functions.
    
    Notes
    -----
    By default, the result is set to the right edge of the window. This can be
    changed to the center of the window by setting ``center=True``.
    
    To learn more about the offsets & frequency strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    The recognized win_types are:
    
    * ``boxcar``
    * ``triang``
    * ``blackman``
    * ``hamming``
    * ``bartlett``
    * ``parzen``
    * ``bohman``
    * ``blackmanharris``
    * ``nuttall``
    * ``barthann``
    * ``kaiser`` (needs beta)
    * ``gaussian`` (needs std)
    * ``general_gaussian`` (needs power, width)
    * ``slepian`` (needs width)
    * ``exponential`` (needs tau), center is set to None.
    
    If ``win_type=None`` all points are evenly weighted. To learn more about
    different window types see `scipy.signal window functions
    <https://docs.scipy.org/doc/scipy/reference/signal.html#window-functions>`__.
    
    Examples
    --------
    
    >>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})
    >>> df
         B
    0  0.0
    1  1.0
    2  2.0
    3  NaN
    4  4.0
    
    Rolling sum with a window length of 2, using the 'triang'
    window type.
    
    >>> df.rolling(2, win_type='triang').sum()
         B
    0  NaN
    1  0.5
    2  1.5
    3  NaN
    4  NaN
    
    Rolling sum with a window length of 2, min_periods defaults
    to the window length.
    
    >>> df.rolling(2).sum()
         B
    0  NaN
    1  1.0
    2  3.0
    3  NaN
    4  NaN
    
    Same as above, but explicitly set the min_periods
    
    >>> df.rolling(2, min_periods=1).sum()
         B
    0  0.0
    1  1.0
    2  3.0
    3  2.0
    4  4.0
    
    A ragged (meaning not-a-regular frequency), time-indexed DataFrame
    
    >>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
    ...                   index = [pd.Timestamp('20130101 09:00:00'),
    ...                            pd.Timestamp('20130101 09:00:02'),
    ...                            pd.Timestamp('20130101 09:00:03'),
    ...                            pd.Timestamp('20130101 09:00:05'),
    ...                            pd.Timestamp('20130101 09:00:06')])
    
    >>> df
                           B
    2013-01-01 09:00:00  0.0
    2013-01-01 09:00:02  1.0
    2013-01-01 09:00:03  2.0
    2013-01-01 09:00:05  NaN
    2013-01-01 09:00:06  4.0
    
    Contrasting to an integer rolling window, this will roll a variable
    length window corresponding to the time period.
    The default for min_periods is 1.
    
    >>> df.rolling('2s').sum()
                           B
    2013-01-01 09:00:00  0.0
    2013-01-01 09:00:02  1.0
    2013-01-01 09:00:03  3.0
    2013-01-01 09:00:05  NaN
    2013-01-01 09:00:06  4.0

To illustrate the meaning of rolling,look at the following example:

index=pd.date_range('2020-5-5',periods=20)
df=pd.DataFrame(np.arange(20),index=index,columns=['test']);df
test
2020-05-050
2020-05-061
2020-05-072
2020-05-083
2020-05-094
2020-05-105
2020-05-116
2020-05-127
2020-05-138
2020-05-149
2020-05-1510
2020-05-1611
2020-05-1712
2020-05-1813
2020-05-1914
2020-05-2015
2020-05-2116
2020-05-2217
2020-05-2318
2020-05-2419
df['sum']=df.test.rolling(3).sum()
df['mean']=df.test.rolling(3).mean()

df['mean1'] = df.test.rolling(3,min_periods=2).mean()
df['expanding']=df.test.expanding().mean()
df
testsummeanmean1expanding
2020-05-050NaNNaNNaN0.0
2020-05-061NaNNaN0.50.5
2020-05-0723.01.01.01.0
2020-05-0836.02.02.01.5
2020-05-0949.03.03.02.0
2020-05-10512.04.04.02.5
2020-05-11615.05.05.03.0
2020-05-12718.06.06.03.5
2020-05-13821.07.07.04.0
2020-05-14924.08.08.04.5
2020-05-151027.09.09.05.0
2020-05-161130.010.010.05.5
2020-05-171233.011.011.06.0
2020-05-181336.012.012.06.5
2020-05-191439.013.013.07.0
2020-05-201542.014.014.07.5
2020-05-211645.015.015.08.0
2020-05-221748.016.016.08.5
2020-05-231851.017.017.09.0
2020-05-241954.018.018.09.5
df['test'].rolling(2).agg([np.sum,np.mean]) #Using agg function to return several results.
summean
2020-05-05NaNNaN
2020-05-061.00.5
2020-05-073.01.5
2020-05-085.02.5
2020-05-097.03.5
2020-05-109.04.5
2020-05-1111.05.5
2020-05-1213.06.5
2020-05-1315.07.5
2020-05-1417.08.5
2020-05-1519.09.5
2020-05-1621.010.5
2020-05-1723.011.5
2020-05-1825.012.5
2020-05-1927.013.5
2020-05-2029.014.5
2020-05-2131.015.5
2020-05-2233.016.5
2020-05-2335.017.5
2020-05-2437.018.5

Look at the result above,since window=3,so the first two datas are Nan.In terms of 'sum',3=0+1+2,6=1+2+3,9=2+3+4...

The expression rolling(250) is similar in behaviour to groupby,but instead of grouping it, it creates an object that grouping over a 250-day sliding window.By default rolling functions require all of the values in the window to be non-Na.This behaviour can be changed to account for missing and ,in particular ,the fact that you will have fewer than window periods of data at the begining of the time series:

appl_std250=close_px.AAPL.rolling(250,min_periods=10).std()
appl_std250[5:12]
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64
appl_std250.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f23f70cf8>

png

In order to compute an expanding window mean,use the expanding operator instead of rolling.The expanding mean starts the time window from the begining of the time series and increases the size of the window until it encompasses the whole series.

To illustrate expanding,look at the following example:

df.test
2020-05-05     0
2020-05-06     1
2020-05-07     2
2020-05-08     3
2020-05-09     4
2020-05-10     5
2020-05-11     6
2020-05-12     7
2020-05-13     8
2020-05-14     9
2020-05-15    10
2020-05-16    11
2020-05-17    12
2020-05-18    13
2020-05-19    14
2020-05-20    15
2020-05-21    16
2020-05-22    17
2020-05-23    18
2020-05-24    19
Freq: D, Name: test, dtype: int32
df.test.expanding().mean()
2020-05-05    0.0
2020-05-06    0.5
2020-05-07    1.0
2020-05-08    1.5
2020-05-09    2.0
2020-05-10    2.5
2020-05-11    3.0
2020-05-12    3.5
2020-05-13    4.0
2020-05-14    4.5
2020-05-15    5.0
2020-05-16    5.5
2020-05-17    6.0
2020-05-18    6.5
2020-05-19    7.0
2020-05-20    7.5
2020-05-21    8.0
2020-05-22    8.5
2020-05-23    9.0
2020-05-24    9.5
Freq: D, Name: test, dtype: float64

Compared with rolling,expanding's window is varialble like cumsum, and rolling's window is fixed.

expanding_mean=appl_std250.expanding().mean()
expanding_mean
2003-01-02          NaN
2003-01-03          NaN
2003-01-06          NaN
2003-01-07          NaN
2003-01-08          NaN
                ...    
2011-10-10    18.521201
2011-10-11    18.524272
2011-10-12    18.527385
2011-10-13    18.530554
2011-10-14    18.533823
Freq: B, Name: AAPL, Length: 2292, dtype: float64
close_px.rolling(60).mean().plot(logy=True)
<matplotlib.axes._subplots.AxesSubplot at 0x14f240f87f0>

png

The rolling function also accepts a string indicating a fixed-size time offset rather than a set number of period.Using this notation can be useful for irregular time series.These are the same string that can be passed to resample.

ser=pd.Series(np.arange(6),pd.date_range('2020-05-05',periods=6));ser
2020-05-05    0
2020-05-06    1
2020-05-07    2
2020-05-08    3
2020-05-09    4
2020-05-10    5
Freq: D, dtype: int32
ser.rolling('2D').mean()
2020-05-05    0.0
2020-05-06    0.5
2020-05-07    1.5
2020-05-08    2.5
2020-05-09    3.5
2020-05-10    4.5
Freq: D, dtype: float64
ser.rolling('D').mean()
2020-05-05    0.0
2020-05-06    1.0
2020-05-07    2.0
2020-05-08    3.0
2020-05-09    4.0
2020-05-10    5.0
Freq: D, dtype: float64
close_px.rolling('20D').mean()
AAPLMSFTXOM
2003-01-027.40000021.11000029.220000
2003-01-037.42500021.12500029.230000
2003-01-067.43333321.25666729.473333
2003-01-077.43250021.42500029.342500
2003-01-087.40200021.40200029.240000
............
2011-10-10389.35142925.60214372.527857
2011-10-11388.50500025.67428672.835000
2011-10-12388.53142925.81000073.400714
2011-10-13388.82642925.96142973.905000
2011-10-14391.03800026.04866774.185333

2292 rows × 3 columns

Exponentially weighted functions

An alternative to using a static window size with equally weighted observations is to sepcify a constant decay factor to give more weight to more recent observations.There are a couple of ways to specify the decay factor. A popular one is using a span,which makes the result comparable to a simple moving window function with window size equal to the span.

Pandas has the ewmoperator to go along with rolling and expanding.Here is an example comparing a 60-day moving average of Apple's stock price with an EW moving average with span=60:

appl_px=close_px.AAPL['2006':'2007']
ma60=appl_px.rolling(30,min_periods=20).mean()
ewma60=appl_px.ewm(span=30).mean()
ma60.plot(label='simple MA')
<matplotlib.axes._subplots.AxesSubplot at 0x14f24293390>

png

ewma60.plot(label='EW MA')
<matplotlib.axes._subplots.AxesSubplot at 0x14f24201128>

png

User-defined moving window functions

The apply method on rolling and related methods provides a means to apply an array function of your own devising over a moving window.The only requirement is that the funciton produce a single value(a reduction) from each piece of the array.For example,while we can compute sample quantiles using rolling(...)quantile(q),we might be interested in the percentile rank of a particular value over the sample.

from scipy.stats import percentileofscore
score_at_2percent=lambda x:percentileofscore(x,0.02)
results=close_px.pct_change()
result=results.AAPL.rolling(250).apply(score_at_2percent)
C:Users旺仔QQ糖AppDataRoamingPythonPython36site-packagesipykernel\__main__.py:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
  if __name__ == '__main__':
result.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x14f2608ec88>

png


免责声明:文章转载自《Time Series in pandas》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇设计原则~Flink on YARN时,如何确定TaskManager数下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

随便看看

sikuli简介

Sikuli脚本自动化,你在屏幕上看到的任何东西。Sikuli是一个开放源码的最初的用户界面设计组织在麻省理工学院的研究项目。现在是保持并进一步协调与开源社区开发的Sikuli实验室在美国科罗拉多州博尔德大学。Sikuli的MIT许可证下发布的。当然,你也可以使用sikuli的javaAPI使其在java环境下运行。小例子大体上了解sikuli的界面,下面来...

zookeeper 日志输出到指定文件夹

最近,我在学习ZookeperStormKafka。顺便说一下,我在本地建立了一个集群。我遇到了Zookeeper日志输出路径的问题。我发现设置log4j。Zookeeper中的属性无法解决日志路径问题。我发现解决方案如下:1.修改log4j属性,您应该能够更改它。我更改了红色粗体,但仍然没有生效。#定义要移动的默认值...

VTK 空间几何变换(Transform),平移、旋转和缩放

空间变换的基础知识变换矩阵我们都知道,在屏幕上显示的是二维的图形,三维图形都是投影到二维平面的,但它们定义在三维空间的。VTK有关空间变换的类和方法VTK相关的类有:vtkTransform,vtkTransformFilter,vtkMatrix4x4等相关的方法有:RotateX、RotateY、RotateZRotateWXYZScaleTransla...

从零开始制作Galgame——我的Ren'py学习笔记(一)

然后点击“启动工程”点击“开始游戏”效果应该是这样的好了,现在你就制作出了属于自己的第一个游戏角色在一般的Galgame中,不是所有话都是“旁白”说的,一个完整的游戏里面应该有主角那么,怎么在ren'py中定义角色呢我们把刚才的代码更改一下definel=Characterlabelstart:l"HelloWorld!...

四、使用ADB命令清除缓存

1、 ADBShell应用程序查看目录结构:adbshells查看系统当前日期:adbselldate查看系统CPU使用情况:adbsHELcat/proc/cpuinfo查看系统内存使用情况:adbshellcat/proc/meminfo显示所有应用程序:adbshelpmlistpackages显示系统自带的应用程序:adshellpmlistpack...

json文件 乱码问题 根本解决办法

1工具→ 定制2单击命令选项卡;3选择上部单选区域中的菜单栏,然后从下拉列表中选择文件;4单击“添加”命令5,在类别中查找文件,找到右侧的高级保存选项,然后单击“确定”6,通过向下移动来调整“文件”菜单中选项的位置。如果你有任何问题,请留言!...