数据可视化基础专题(三十四):Pandas基础(十四) 分组(二)Aggregation/apply

摘要:
聚合创建GroupBy对象后,可以使用多种方法对分组数据执行计算。这些操作与聚合API、窗口API和示例API类似。没有任何其他方法

Aggregation

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. These operations are similar to the aggregating APIwindow API, and resample API.

An obvious one is aggregation via the aggregate() or equivalently agg() method:

In [67]: grouped = df.groupby("A")

In [68]: grouped.aggregate(np.sum)
Out[68]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [69]: grouped = df.groupby(["A", "B"])

In [70]: grouped.aggregate(np.sum)
Out[70]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

As you can see, the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:

In [71]: grouped = df.groupby(["A", "B"], as_index=False)

In [72]: grouped.aggregate(np.sum)
Out[72]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

In [73]: df.groupby("A", as_index=False).sum()
Out[73]: 
     A         C         D
0  bar  0.392940  1.732707
1  foo -1.796421  2.824590

Note that you could use the reset_index DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex:

In [74]: df.groupby(["A", "B"]).sum().reset_index()
Out[74]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.

In [75]: grouped.size()
Out[75]: 
     A      B  size
0  bar    one     1
1  bar  three     1
2  bar    two     1
3  foo    one     2
4  foo  three     1
5  foo    two     2
In [76]: grouped.describe()
Out[76]: 
      C                                                    ...         D                                                  
  count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max
0   1.0  0.254161       NaN  0.254161  0.254161  0.254161  ...       NaN  1.511763  1.511763  1.511763  1.511763  1.511763
1   1.0  0.215897       NaN  0.215897  0.215897  0.215897  ...       NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
2   1.0 -0.077118       NaN -0.077118 -0.077118 -0.077118  ...       NaN  1.211526  1.211526  1.211526  1.211526  1.211526
3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061
4   1.0 -0.862495       NaN -0.862495 -0.862495 -0.862495  ...       NaN  0.024580  0.024580  0.024580  0.024580  0.024580
5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081

[6 rows x 16 columns]

Another aggregation example is to compute the number of unique values of each group. This is similar to the value_counts function, except that it only counts unique values.

In [77]: ll = [['foo', 1], ['foo', 2], ['foo', 2], ['bar', 1], ['bar', 1]]

In [78]: df4 = pd.DataFrame(ll, columns=["A", "B"])

In [79]: df4
Out[79]: 
     A  B
0  foo  1
1  foo  2
2  foo  2
3  bar  1
4  bar  1

In [80]: df4.groupby("A")["B"].nunique()
Out[80]: 
A
bar    1
foo    2
Name: B, dtype: int64

Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:

Function

Description

mean()

Compute mean of groups

sum()

Compute sum of group values

size()

Compute group sizes

count()

Compute count of group

std()

Standard deviation of groups

var()

Compute variance of groups

sem()

Standard error of the mean of groups

describe()

Generates descriptive statistics

first()

Compute first of group values

last()

Compute last of group values

nth()

Take nth value, or a subset if n is a list

min()

Compute min of group values

max()

Compute max of group values

The aggregating functions above will exclude NA values. Any function which reduces a Series to a scalar value is an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser: 1). Note that nth() can act as a reducer or a filter, see here.

1 Applying multiple functions at once

With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:

In [81]: grouped = df.groupby("A")

In [82]: grouped["C"].agg([np.sum, np.mean, np.std])
Out[82]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:

In [83]: grouped.agg([np.sum, np.mean, np.std])
Out[83]: 
            C                             D                    
          sum      mean       std       sum      mean       std
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this:

In [84]: (
   ....:     grouped["C"]
   ....:     .agg([np.sum, np.mean, np.std])
   ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
   ....: )
   ....: 
Out[84]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

For a grouped DataFrame, you can rename in a similar manner:

In [85]: (
   ....:     grouped.agg([np.sum, np.mean, np.std]).rename(
   ....:         columns={"sum": "foo", "mean": "bar", "std": "baz"}
   ....:     )
   ....: )
   ....: 
Out[85]: 
            C                             D                    
          foo       bar       baz       foo       bar       baz
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

In general, the output column names should be unique. You can’t apply the same function (or two functions with the same name) to the same column.

In [86]: grouped["C"].agg(["sum", "sum"])
Out[86]: 
          sum       sum
A                      
bar  0.392940  0.392940
foo -1.796421 -1.796421

pandas does allow you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless) lambda functions, appending _<i> to each subsequent lambda.

In [87]: grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
Out[87]: 
     <lambda_0>  <lambda_1>
A                          
bar    0.331279    0.084917
foo    2.337259   -0.215962

2 Named aggregation

To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

  • The keywords are the output column names

  • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

In [88]: animals = pd.DataFrame(
   ....:     {
   ....:         "kind": ["cat", "dog", "cat", "dog"],
   ....:         "height": [9.1, 6.0, 9.5, 34.0],
   ....:         "weight": [7.9, 7.5, 9.9, 198.0],
   ....:     }
   ....: )
   ....: 

In [89]: animals
Out[89]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [90]: animals.groupby("kind").agg(
   ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
   ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
   ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
   ....: )
   ....: 
Out[90]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.

In [91]: animals.groupby("kind").agg(
   ....:     min_height=("height", "min"),
   ....:     max_height=("height", "max"),
   ....:     average_weight=("weight", np.mean),
   ....: )
   ....: 
Out[91]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

If your desired output column names are not valid Python keywords, construct a dictionary and unpack the keyword arguments

In [92]: animals.groupby("kind").agg(
   ....:     **{
   ....:         "total weight": pd.NamedAgg(column="weight", aggfunc=sum)
   ....:     }
   ....: )
   ....: 
Out[92]: 
      total weight
kind              
cat           17.8
dog          205.5

Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().

Note

For Python 3.5 and earlier, the order of **kwargs in a functions was not preserved. This means that the output column ordering would not be consistent. To ensure consistent ordering, the keys (and so output columns) will always be sorted for Python 3.5.

Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

In [93]: animals.groupby("kind").height.agg(
   ....:     min_height="min",
   ....:     max_height="max",
   ....: )
   ....: 
Out[93]: 
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0

3 Applying different functions to DataFrame columns

By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:

In [94]: grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
Out[94]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or available via dispatching:

In [95]: grouped.agg({"C": "sum", "D": "std"})
Out[95]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

4 Cython-optimized aggregation functions

Some common aggregations, currently only summeanstd, and sem, have optimized Cython implementations:

In [96]: df.groupby("A").sum()
Out[96]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [97]: df.groupby(["A", "B"]).mean()
Out[97]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.491888  0.807291
    three -0.862495  0.024580
    two    0.024925  0.592714

Of course sum and mean are implemented on pandas objects, so the above code would work even without the special versions via dispatching (see below).

Flexible apply

Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example:

In [156]: df
Out[156]: 
     A      B         C         D
0  foo    one -0.575247  1.346061
1  bar    one  0.254161  1.511763
2  foo    two -1.143704  1.627081
3  bar  three  0.215897 -0.990582
4  foo    two  1.193555 -0.441652
5  bar    two -0.077118  1.211526
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

In [157]: grouped = df.groupby("A")

# could also just call .describe()
In [158]: grouped["C"].apply(lambda x: x.describe())
Out[158]: 
A         
bar  count    3.000000
     mean     0.130980
     std      0.181231
     min     -0.077118
     25%      0.069390
                ...   
foo  min     -1.143704
     25%     -0.862495
     50%     -0.575247
     75%     -0.408530
     max      1.193555
Name: C, Length: 16, dtype: float64

The dimension of the returned result can also change:

In [159]: grouped = df.groupby('A')['C']

In [160]: def f(group):
   .....:     return pd.DataFrame({'original': group,
   .....:                          'demeaned': group - group.mean()})
   .....: 

In [161]: grouped.apply(f)
Out[161]: 
   original  demeaned
0 -0.575247 -0.215962
1  0.254161  0.123181
2 -1.143704 -0.784420
3  0.215897  0.084917
4  1.193555  1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211

apply on a Series can operate on a returned value from the applied function, that is itself a series, and possibly upcast the result to a DataFrame:

In [162]: def f(x):
   .....:     return pd.Series([x, x ** 2], index=["x", "x^2"])
   .....: 

In [163]: s = pd.Series(np.random.rand(5))

In [164]: s
Out[164]: 
0    0.321438
1    0.493496
2    0.139505
3    0.910103
4    0.194158
dtype: float64

In [165]: s.apply(f)
Out[165]: 
          x       x^2
0  0.321438  0.103323
1  0.493496  0.243538
2  0.139505  0.019462
3  0.910103  0.828287
4  0.194158  0.037697

免责声明:文章转载自《数据可视化基础专题(三十四):Pandas基础(十四) 分组(二)Aggregation/apply》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇区块链共识算法 PBFT(拜占庭容错)、PAXOS、RAFT简述C# params 用法简介下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

利用Python进行数据分析-Pandas(第五部分-数据规整:聚合、合并和重塑)

  在许多应用中,数据可能分散在许多文件或数据库中,存储的形式也不利于分析。本部分关注可以聚合、合并、重塑数据的方法。 1、层次化索引   层次化索引(hierarchical indexing)是pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引级别。抽象点说,它使你能以低纬度形式处理高纬度数据。我们来看一个简单的栗子:创建一个Ser...

Pandas系列(十一)-文件IO操作

数据分析过程中经常需要进行读写操作,Pandas实现了很多 IO 操作的API,这里简单做了一个列举。 格式类型 数据描述 Reader Writer text CSV read_ csv to_csv text JSON read_json to_json text HTML read_html to_html text clipb...

pandas重塑层次化索引(stack()和unstack()函数解析)

在数据处理时,有时需要对数据的结构进行重排,也称作是重塑(Reshape)或者轴向旋转(Pivot)。而运用层次化索引可为 DataFrame 的数据重排提供良好的一致性。在 pandas 中提供了实现重塑的两个函数,即 stack() 函数和 unstack() 函数。常见的数据层次化结构有两种,一种是表格,如图 1 所示;另一种是“花括号”,如图 2...

关于数据分析与可视化

今天小编来分享在pandas当中经常会被用到的方法,篇幅可能有点长但是提供的都是干货,读者朋友们看完之后也可以点赞收藏,相信会对大家有所帮助,大致本文会讲述这些内容 DataFrame初印象 读取表格型数据 筛选出特定的行 用pandas来绘图 在DataFrame中新增行与列 DataFrame中的统计分析与计算 DataFrame中排序问题 合并多...

python绘图:matplotlib和pandas的应用

在进行数据分析时,绘图是必不可少的模式探索方式。用Python进行数据分析时,matplotlib和pandas是最常用到的两个库。1、matplotlib库的应用准备工作如下:打开ipython,输入命令分别导入numpy和matplotlib.pylab库。 [python] view plain copy  import numpy as n...

Python之pandas读取mysql中文乱码问题

# -*- coding: utf-8 -*- # author:baoshan import pandas as pd import pymysql config = { "host": "localhost", "port": 3306, "user": "root", "password": "12...