目录
- 1,分布函数,
- 1.1,percent_rank()
- 1.2,cume_dist()
- 1.3 SQL例子
- 2,Pandas 实现
- 3,补充Pandas实现排序
1,分布函数,
应用场景:快速查看某个记录所归属的组内的比例
分布函数分类及基础语法:
percent_rank()over(partition by 分区字段 order by 排序字段 desc/asc);
cume_dist()over (partition by 分区字段 order by 排序字段 desc/asc);
1.1,percent_rank()
显示的结果,每行按照如下公式进行计算: (rank-1)/(rows-1)
rank 为 RANK() 函数产生的序号,rows 为当前窗口的记录总行数
对于重复值,取重复值的第一行记录的位置
1.2,cume_dist()
显示的结果,每行按照如下公式进行计算:rank/rows
rank 为 RANK() 函数产生的序号,rows 为当前窗口的记录总行数
对于重复值,取重复值的最后一行记录的位置
1.3 SQL例子
select
uid,exam_id,score,
rank() over(partition by exam_id order by score desc) ranks,
count(score) over(partition by exam_id order by score desc) row_num,
percent_rank() over(partition by exam_id order by score desc) percent_ranks,
cume_dist() OVER (partition by exam_id order by score desc) as cume_dists
from exam_record
输出结果:
2,Pandas 实现
import pandas as pd
test_df = pd.read_csv('../data/test.csv',dtype={'uid':object,'exam_id':object})
test_df.sort_values(by=['exam_id','score'],ascending=[True,False],inplace=True)
test_df['rank'] = test_df.groupby(['exam_id'])['score'].rank(ascending=False,method='min') ##分组排序,method='min' 等同于 rank()
test_df['row_num'] = test_df.groupby(['exam_id'])['score'].transform('count') ##分组,使用transform,返回与输入相同尺寸的对象
test_df['percent_rank'] = test_df.apply(lambda x:(x['rank']-1)/(x['row_num']-1),axis=1) ## 计算 percent_rank
test_df['cume_dist'] = test_df.apply(lambda x:x['rank']/x['row_num'],axis=1) ## 计算 cume_dist
test_df
输出结果:
3,补充Pandas实现排序
import pandas as pd
test_df2 = pd.read_csv('../data/test.csv',dtype={'uid':object,'exam_id':object})
test_df2.sort_values(by=['exam_id','score'],ascending=[True,False],inplace=True)
test_df2['row_number'] = test_df2.groupby(['exam_id'])['score'].rank(ascending=False,method='first') ## method='first' 等同于 row_number()
test_df2['rank'] = test_df2.groupby(['exam_id'])['score'].rank(ascending=False,method='min') ## method='min' 等同于 rank()
test_df2['dense_rank'] = test_df2.groupby(['exam_id'])['score'].rank(ascending=False,method='dense') ## method='dense' 等同于 dense_rank()
test_df2
输出结果: