A Complete Guide to Time Series Analysis in Pandas

What is the time-series analysis?

It is the analysis of the dataset that has a sequence of time stamps. It has become more and more important with the increasing emphasis on machine learning. So many different types of industries use time-series data now for time series forecasting, seasonality analysis, finding trends, and making important business and research decisions. So it is very important as a data scientist or data analyst to understand the time series data clearly.

Time series data is used in:

Banks and Financial institutes

Stock market

Social Media

Power, gas, and oil industries

Periodic measures in a mechanical or chemical process

And many more areas

Itis unavoidable. So, let’s learn it.

I will start with some general functions and show some more topics using theFacebook Stock price dataset.

Let’s dive in!

Using the correct format

Time series data can come in with so many different formats. But not all of those formats are friendly to python’s pandas’ library. The most convenient format is the timestamp format for Pandas. But most of the time time-series data come in string formats. Here I have the example of the different formats time series data may be found in.

import pandas as pdimport numpy as np
dates = ['2020-11-25 2:30:00 PM', 'Jan 5, 2020 18:45:00', '01/11/2020', '2020.01.11', '2020/01/11', '20201105']

The ‘dates’ variable above showing five different formats of date-time settings and all are correct. But we need this specific format to work conveniently. So, convert those dates to the right format.

pd.to_datetime(dates)

Output:

DatetimeIndex(['2020-11-25 14:30:00', '2020-01-05 18:45:00', '2020-01-11 00:00:00', '2020-01-11 00:00:00', '2020-01-11 00:00:00', '2020-11-05 00:00:00'], dtype='datetime64[ns]', freq=None)

There are other countries around the world, who use days first. For example in Americal style June 1st, 2002 is written as ‘6/1/2020’. But in England, South Asian countries like India, Bangladesh, Pakistan, and some other parts of the world write it as ‘1/6/2020’.

If you are working for a client from those other parts of the world, here is how to format the dates.

pd.to_datetime(dates).strftime('%d-%m-%y')

Output:

Index(['25-11-20', '05-01-20', '11-01-20', '11-01-20', '11-01-20', '05-11-20'], dtype='object')

I just used ‘%d-%m-%y’ as a format here. You can change the sequence as required.

If you need to put the month first or year first, you only need to change the sequence in the format.

Now I will import the dataset that we will use to demonstrate many of the functions. Our Facebook Stock data. Feel free to download the dataset here and follow along.

The only way, you will learn is by doing.

If you reading this to learn, I strongly recommend to practice along with reading.

df = pd.read_csv('FB_data.csv')df.head()

A Complete Guide to Time Series Analysis in Pandas – Regenerative (1)

This is a raw dataset. Doesn’t look bad! Well organized. But we need to change the format of the ‘Date’ column as we discussed earlier. We will make it to the DatetimeIndex format and put it as the index column.

Because when the ‘date’ column is the index column we will be able to resample it very easily. You will see what that means in the later sections.

Here is the correct way of importing the data where I am changing the format of the dates and setting it as an index while importing.

df = pd.read_csv('FB_data.csv', parse_dates=['Date'], index_col="Date")df.head()

A Complete Guide to Time Series Analysis in Pandas – Regenerative (2)

Look, we changed the format of the ‘Date’ column!

Resampling

In this section, I will discuss how to resample the data.

Why resampling is important?

Because we do not always need all the data in a huge dataset. For example, we may need only the data from June 2019. If your date format is in DatetimeIndex, it is very easy:

df["2019-06"]

A Complete Guide to Time Series Analysis in Pandas – Regenerative (3)

We have the data for eight days only. Find the mean of the opening stock price in June 2019.

df["2019-06"].Open.mean()

Output:

190.71000014285715

We can get the data on an individual date as well.

df.loc["2019-06-21"]

Output:

Open 1.887500e+02High 1.920000e+02Low 1.887500e+02Close 1.911400e+02Adj Close 1.911400e+02Volume 2.275120e+07Name: 2019-06-21 00:00:00, dtype: float64

Let’s say, we need two weeks’ data from June 27th to July 10th of 2019.

df.loc["2019-06-27": "2019-07-10"]

A Complete Guide to Time Series Analysis in Pandas – Regenerative (4)

You can also resample by month. That gives the monthly average. For example, here I will get the monthly average of closing data:

df.Close.resample('M').mean()

Output:

Date2019-06-30 190.3242862019-07-31 199.5954542019-08-31 184.4977262019-09-30 185.7350002019-10-31 184.3839122019-11-30 195.7185002019-12-31 201.9519042020-01-31 216.6433332020-02-29 207.5052632020-03-31 165.7477272020-04-30 177.0033352020-05-31 216.5490012020-06-30 232.671332Freq: M, Name: Close, dtype: float64

We can take the monthly average and plot with just one line of code:

df.Close.resample('M').mean().plot()

A Complete Guide to Time Series Analysis in Pandas – Regenerative (5)

If you want weekly data and plot it, you can get it by this code:

df.Close.resample('W').mean().plot()

Instead of simple line plot, you can get total 13 types of plots using a ‘kind’ parameter in plot() function.

I named those 13 types of plots after this bar plot.

I will make a bar plot of quarterly closing data.

df.Close.resample('Q').mean().plot(kind='bar')

A Complete Guide to Time Series Analysis in Pandas – Regenerative (6)

The ‘kind’ parameter above takes the following 13 types of visualization:

line
area
bar
barh
pie
box
hexbin
hist
kde
density
scatter

Please see this article for details about those visualizations.

Shift

The way it sounds, the shift function shifts the data to the specified number of times. FB dataset we are using starts on June 20th, 2019. Now, if we shift our data by 1, June 20th, 2019 data will move to June 21st, 2019, June 21st, 2019 data will shift to June 22nd, 2019, and so on.

I will explain a lillte later why people use shift

For this example, I will only use the column. You will see the shifts very clearly. I will make a new DataFrame called ‘df1′ with only opening data.

df1 = pd.DataFrame(df['Open'])df1.head()

A Complete Guide to Time Series Analysis in Pandas – Regenerative (7)

Shift this data by 1.

df1.shift(1)

A Complete Guide to Time Series Analysis in Pandas – Regenerative (8)

The first row has a null value. Simply because the first row moves to the second row. But there is no data before the first row.

If you use a negative value in shift it will do just the opposite.

df1.shift(-1)

A Complete Guide to Time Series Analysis in Pandas – Regenerative (9)

Why we use shift?

The shift gives you the previous day data or the next day’s data. Here is a use case. I will put today’s data and the previous day data side by side using shift.

df1['Prev Day Opening'] = df1['Open'].shift(1)df1

A Complete Guide to Time Series Analysis in Pandas – Regenerative (10)

Isn’t it useful!?

You can also get the change in 1-day data in another column:

df1['1 day change'] = df1['Open'] - df1['Prev Day Opening']

Find the 1 week total in percentage. for that, we have to shift by 5 days. Right? And then take the difference from today and 5 days early data. Multiply by 100 and divide by the original today data. Let’s see it to understand it better.

df1['One week total return'] = (df1['Open'] - df1['Open'].shift(5)) * 100/df1['Open'].shift(5)df1.tail()

A Complete Guide to Time Series Analysis in Pandas – Regenerative (11)

I am taking df.tail() because we did a 5-day shift. So the first 5 rows will be null.

Timezone

Understanding of timezone is important. Probably, you are in one timezone and your client is in another timezone. Pandas have great functionality to deal with different timezones.

We have two types of DateTime data.Naive DateTimewhich has no idea about timezone andtime zone aware DateTimethat knows the time zone. The data we have is naive DateTime. So, we need to use tz_localize to convert this DateTime.

Convert the index of the Facebook dataset to ‘US/Eastern’.

df.index = df.index.tz_localize(tz = 'US/Eastern')df.index

Output:

DatetimeIndex(['2019-06-20 00:00:00-04:00', '2019-06-21 00:00:00-04:00', '2019-06-24 00:00:00-04:00', '2019-06-25 00:00:00-04:00', '2019-06-26 00:00:00-04:00', '2019-06-27 00:00:00-04:00', '2019-06-28 00:00:00-04:00', '2019-07-01 00:00:00-04:00', '2019-07-02 00:00:00-04:00', '2019-07-03 00:00:00-04:00', ... '2020-06-08 00:00:00-04:00', '2020-06-09 00:00:00-04:00', '2020-06-10 00:00:00-04:00', '2020-06-11 00:00:00-04:00', '2020-06-12 00:00:00-04:00', '2020-06-15 00:00:00-04:00', '2020-06-16 00:00:00-04:00', '2020-06-17 00:00:00-04:00', '2020-06-18 00:00:00-04:00', '2020-06-19 00:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', name='Date', length=253, freq=None)

On each date, it shows negative 4 hours. Again, if we convert it to ‘Europe/Berline’ it will add 6 hours to it.

df = df.tz_convert('Europe/Berlin')df.index

Output:

DatetimeIndex(['2019-06-20 06:00:00+02:00', '2019-06-21 06:00:00+02:00', '2019-06-24 06:00:00+02:00', '2019-06-25 06:00:00+02:00', '2019-06-26 06:00:00+02:00', '2019-06-27 06:00:00+02:00', '2019-06-28 06:00:00+02:00', '2019-07-01 06:00:00+02:00', '2019-07-02 06:00:00+02:00', '2019-07-03 06:00:00+02:00', ... '2020-06-08 06:00:00+02:00', '2020-06-09 06:00:00+02:00', '2020-06-10 06:00:00+02:00', '2020-06-11 06:00:00+02:00', '2020-06-12 06:00:00+02:00', '2020-06-15 06:00:00+02:00', '2020-06-16 06:00:00+02:00', '2020-06-17 06:00:00+02:00', '2020-06-18 06:00:00+02:00', '2020-06-19 06:00:00+02:00'], dtype='datetime64[ns, Europe/Berlin]', name='Date', length=253, freq=None)

You can find all the time zones available in the world and use the one suitable for you this way:

from pytz import all_timezonesprint(all_timezones)

Here is part of the output. The full output is too big:

['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Asmera', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo',.....

How to generate missing dates

What if you have the data and you know the period but the time is not recorded in the dataset. Or you have data for the second quarter of last year but you do not have that for this year. And you need to use last year’s data this year.

There might be many occasions where you may need to generate a series of dates. Pandasdate_rangefunction will come in handy. Let’s generate a period of 10 days:

So, we can generate a series of dates and add them to our dataset!

Rolling

Rolling function aggregates data for a specified number of DateTime. Here I will take the mean of every three days. I will explain some more after working on this example:

df[["High"]].rolling(3).mean()[:10]

A Complete Guide to Time Series Analysis in Pandas – Regenerative (12)

What exactly happened here? I passed 3 as an argument in the rolling function and the aggregate function is mean. So, it is taking a mean of 20th, 21st, and 24th June ‘High’ data and putting on 24th. Doing the same for 21st, 24th, and 25th data and putting on 25th and so on. Lots of time we use the weekly average or 3-day average results to make decisions.

You can also choose where to put the rolling data. Here is an example:

data_rol = df[['High', 'Low']].rolling(window = 7, center = True).mean()data_rol

A Complete Guide to Time Series Analysis in Pandas – Regenerative (13)

Here in rolling function, I passed window = 7. That means it will take a 7-day average. center = True means it will put that average in the 4th row instead of the 7th row. That’s why it has some null values at the bottom as well.

Let’s plot the original ‘High’ data and 7 days rolled ‘High’ data in the same plot:

%matplotlib inlineimport matplotlib.ticker as ticker fig, ax = plt.subplots(figsize= (11, 4))
ax.plot(df['High'], marker = '.', markersize=4, color='0.4', linestyle='None', label='Daily')
ax.xaxis.set_major_locator(ticker.MultipleLocator(30))ax.plot(data_rol['High'], linewidth=2, label='7-d rolling mean')ax.set_xlabel('Month')

A Complete Guide to Time Series Analysis in Pandas – Regenerative (14)

Usually, this type of plot is used to observe any trend in the data

In our data, there is a trend observable. After January 2020 the values start dropping and the curve is steep. In time series analysis we sometimes work for finding the trend. But sometimes we need to remove the trends from the data. Especially when we need to use the time series data for machine learning or forecasting.

Again after the march, it has a steep rise. In the next section, I will show you how to get rid of this type of trend.

Differencing for removing the trend

If there is any trend in the data, it is not good for modeling, forecasting, or observing seasonality. To improve model performance, or to observe any seasonality or any noise in the data, differencing is a common practice. It takes the difference in data for a specified number of days. Here is an example:

df_first_order_diff = df[['High', 'Low']].diff()df_first_order_diff

A Complete Guide to Time Series Analysis in Pandas – Regenerative (15)

Here I did not specify any number of days in the .diff() function. So by default, it took just a 1-day difference.

Do you see what happened in the resulting table? The ‘High’ and ‘Low’ data is ‘20–06–19’ is the difference in ‘High’ and ‘Low’ data of 21–06–19 and 20–06–19. And it is set in 21–06–19. That’s why it’s null in 20–06–19. Because there no data before that to subtract.

This process of differencing is supposed to remove the trend. By any chance it does not, try with a 3 day differencing or 7 days differencing. This is how to take a 3 day differencing:

df[['High', 'Low']].diff(3)

Let’s plot the data from first-order differencing from above to see if the trend we observed in the last section is removed.

start = '20-06-19'
fig, ax = plt.subplots(figsize = (11, 4))
ax.plot(df_first_order_diff.loc[start:, "High"], marker = 'o', markersize = 4, linestyle = '-', label = 'First Order Differencing')ax.xaxis.set_major_locator(ticker.MultipleLocator(30))

A Complete Guide to Time Series Analysis in Pandas – Regenerative (16)

Look that obvious trend is gone! Feel free to check with the 3day differencing I talked about earlier if you can get rid of that slight trend at the end.

Important Time Feature Extraction

You can extract the year, month, week, or weekday from the time series that can be very useful. Let’s start with extracting the year from our index column ‘Date’.

pd.DatetimeIndex(df.index).year

Output:

Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, ... 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], dtype='int64', name='Date', length=253)

Now, take a subset of the dataset to make it smaller and add the years in a separate column.

df3 = df[['High','Low', 'Volume']]df3['Year'] = pd.DatetimeIndex(df3.index).yeardf3

A Complete Guide to Time Series Analysis in Pandas – Regenerative (17)

See we added the year in the end. In the same way, you can extract the month and weekday. Here is the code for that:

pd.DatetimeIndex(df3.index).month

Output:

Int64Index([6, 6, 6, 6, 6, 6, 6, 7, 7, 7, ... 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], dtype='int64', name='Date', length=253)

Here is how to extract the weekdays:

pd.DatetimeIndex(df3.index).weekday

Output:

Int64Index([3, 4, 0, 1, 2, 3, 4, 0, 1, 2, ... 0, 1, 2, 3, 4, 0, 1, 2, 3, 4], dtype='int64', name='Date', length=253)

Weekday comes out to be the numbers. What if you need weekdays format as Sunday, Monday, and so on? That will be more useful!

df3['Weekday'] = pd.DatetimeIndex(df3.index).to_series().dt.day_name()df3.head()

A Complete Guide to Time Series Analysis in Pandas – Regenerative (18)

Let’s check if weekday has any effect on the ‘High’, ‘Low’, and ‘Volume’ data.

import seaborn as sns
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(['High', 'Low', 'Volume'], axes): sns.boxplot(data=df3, x = 'Weekday', y = name, ax=ax) ax.set_title(name)

A Complete Guide to Time Series Analysis in Pandas – Regenerative (19)

Weekday has an effect on those data, right? On Wednesday ‘High’, ‘Low’ and ‘Volume’ everything is higher. On Monday it’s the opposite. Boxplots give a lot of information in one bundle. If you need a refresher on how to extract all the data from boxplots, here is a detailed article.

Period and PeriodIndex

Another essential python function. We will learn it by doing. The most basic way of using the Period function:

y = pd.Period('2020')y

Output:

Period('2020', 'A-DEC')

This output shows that this period ‘2020’ will end in December. Make sense, right?

Here is the directory of all the information that can be extracted from the Period function:

dir(y)

Here is part of the output. Because the directory is big!

.....'asfreq', 'day', 'dayofweek', 'dayofyear', 'days_in_month', 'daysinmonth', 'end_time', 'freq', 'freqstr', 'hour', 'is_leap_year', 'minute', 'month', 'now', 'ordinal', 'quarter', 'qyear', 'second', 'start_time', 'strftime', 'to_timestamp', 'week', 'weekday', 'weekofyear', 'year']

I will demonstrate some of them.

y.start_time

Output:

Timestamp('2020-01-01 00:00:00')

Input:

y.end_time

Output:

Timestamp('2020-12-31 23:59:59.999999999')

If we use the frequency as month:

month = pd.Period('2020-2', freq="M") month

Output:

Period('2020-02', 'M')

Input:

month.start_time

Output:

Timestamp('2020-02-01 00:00:00')

Input:

month.end_time

Output:

Timestamp('2020-02-29 23:59:59.999999999')

What can we do with this type of month data?

You can add or subtract if necessary. For example, if you have age data of students and need to update the years or months, you can do that like this:

month + 2

Output:

Period('2020-04', 'M')

In the same way, you can add or subtract days. If we put a date it will take the frequency as the day by default.

d = pd.Period('2020-02-28')d

Output:

Period('2020-02-28', 'D')

If you add a day or two it will add a day or two. But the date I put here is February 28th. That is different, right? In leap years we have 29 days in February and the other years we have 28 days in February.

Let’s add 2 days on top of the date d above:

d + 2

Output:

Period('2020-03-01', 'D')

After adding 2 days to February 28th, I got March 1st. That means the Period function knows the leap years.

In the same way, you can add year, hours, minutes even quarters.

Let’s see an example of a quarter:

q = pd.Period('2020Q1')q

Output:

Period('2020Q1', 'Q-DEC')

Check the start and end-month of q.

q.asfreq('M', how='start')

Output:

Period('2020-01', 'M')

The first month of 2020Q1 is January. That means by default the 1st quarter starts in January. Check when it ends. Though we know it should end in March.

q.asfreq('M', how='end')

Output:

Period('2020-03', 'M')

The Period q starts in January and ends in March. The business year does not start in January and end in March everywhere. I will talk about it some more in a minute.

Add 1 to the q:

q + 1

Output:

Period('2020Q2', 'Q-DEC')

Here, ‘Q-DEC’ means the quarter ends in December. There are four quarters in a year and the last quarter ends in December. But there are several industries out there who use January as the end of the 4th quarter or June as the 4th quarter.

We can specify the end of quarters using a ‘freq’ parameter. In the next example, I will use the end of the fourth quarter as January.

q1 = pd.Period('2020Q2', freq = 'Q-Jan')q1

Output:

Period('2020Q2', 'Q-JAN')

Look, Here we changed the end of the 4th quarter to January! Feel free to check the start and end-month of q1. You will see the start month will be march instead of April. Because the first quarter runs from February to April.

The way we generated date_range before, we can generate period range as well:

idx = pd.period_range('2017', '2020', freq = 'Q')idx

Output:

PeriodIndex(['2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2', '2018Q3', '2018Q4', '2019Q1', '2019Q2', '2019Q3', '2019Q4', '2020Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

By default, it started by ‘2017Q1’. Because by default quarter starts from January and ends in December. But as before if we specify the end of the Quarter in January, it will start with 2017Q4.

idx = pd.period_range('2017', '2020', freq = 'Q-Jan')idx

Output:

PeriodIndex(['2017Q4', '2018Q1', '2018Q2', '2018Q3', '2018Q4', '2019Q1', '2019Q2', '2019Q3', '2019Q4', '2020Q1', '2020Q2', '2020Q3', '2020Q4'], dtype='period[Q-JAN]', freq='Q-JAN')

You can convert these quarters to timestamps:

idx = idx.to_timestamp()idx

Output:

DatetimeIndex(['2016-11-01', '2017-02-01', '2017-05-01', '2017-08-01', '2017-11-01', '2018-02-01', '2018-05-01', '2018-08-01', '2018-11-01', '2019-02-01', '2019-05-01', '2019-08-01', '2019-11-01'], dtype='datetime64[ns]', freq='QS-NOV')

Again, when we have timestamps we can convert it to quarters using to_period().

idx.to_period()

Output:

PeriodIndex(['2016Q4', '2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2', '2018Q3', '2018Q4', '2019Q1', '2019Q2', '2019Q3', '2019Q4'], dtype='period[Q-DEC]', freq='Q-DEC')

Congratulations! You just learned to perform a time series analysis on any dataset now!

Conclusion

I tried to document and explain most of the major pandas’ function for time series analysis. After working on this entire page, you should have enough knowledge to perform an efficient time series analysis on any time series data. But remember, it will take a lot of practice to become proficient at using all these functions! Happy coding!

Feel free to follow me onTwitterand like myFacebookpage.

#datascience #dataAlatytics #python #programming #DataAnalysis

A Complete Guide to Time Series Analysis in Pandas – Regenerative (2024)

What is the time-series analysis?

Using the correct format

Resampling

Shift

Timezone

How to generate missing dates

So, we can generate a series of dates and add them to our dataset!

Rolling

Differencing for removing the trend

Important Time Feature Extraction

Conclusion