对于cf和swing算法的介绍可以参考ItemCF的演进:狭义 VS 广义
基于cf的推荐相似度
这里介绍这样一个场景,我们有了大量的电商购买数据,希望通过cf算法计算不同的类目之间的相似度,以方便对用户购买进行兴趣探索。
使用SQL实现需要通过join来完成user和cate构成矩阵计算cate之间的余弦相似度,公式如下,而这里又分成三步
cos
(
A
,
B
)
=
a
1
b
1
+
a
2
b
2
+
…
+
a
n
b
n
(
a
1
a
1
+
a
2
a
2
+
…
+
a
n
a
n
)
(
b
1
b
1
+
b
2
b
2
+
…
+
b
n
∗
b
n
)
\cos(A, B)=\dfrac{a_1b_1+a_2b_2+…+a_nb_n}{\sqrt{(a_1a_1+a_2a_2+…+a_na_n)(b_1b_1+b_2b_2+…+b_n*b_n)}}
cos(A,B)=(a1a1+a2a2+…+anan)(b1b1+b2b2+…+bn∗bn)a1b1+a2b2+…+anbn
user_cate
获得user对每个cate的score:普通的score可以设计为用户的订单量,但这会导致高活跃用户对相似度计算贡献太多导致失真,因此这里设计score为用户订单量对数的倒数,即对高活跃用户对相似度的贡献进行惩罚;同时为了防止高热类目计算量过大,使用随机采样保证每个类目下订单量少于10000000cc_prod
计算同user下的cate交叉score乘积,里面包含余弦相似度矩阵的分子(即内积)和分母(即模)- 最后计算余弦相似度,内积除以两个模
with user_cate as (
select
t1.user_id,
t1.cate_id,
sum(1 / log(t2.act + 1)) as score
from (
select
user_id,
category_id as cate_id,
row_number() over(partition by category_id order by rand(47)) as rn
from
project.my_table
where
p_date > '{{ds_nodash - 30}}' and p_date <= '{{ds_nodash}}'
and user_id > 0
and category_id > 0
and pay_order_cnt > 0
) t1
join (
select
user_id,
count(1) as act
from
project.my_table
where
p_date > '{{ds_nodash - 30}}' and p_date <= '{{ds_nodash}}'
and user_id > 0
and category_id > 0
and pay_order_cnt > 0
group by user_id
) t2 on t1.user_id = t2.user_id
where t1.rn < 10000000
group by t1.user_id,
t1.cate_id
),
cc_prod as (
SELECT cate1, cate2, sum(score_prod) AS prod
FROM (
SELECT t1.cate_id AS cate1, t2.cate_id AS cate2, t1.score*t2.score as score_prod
FROM user_cate t1
JOIN user_cate t2
ON t1.user_id = t2.user_id
) t
GROUP BY cate1, cate2
)
SELECT t1.cate1, t1.cate2, t1.prod/SQRT(t2.prod*t3.prod) AS sim
FROM cc_prod t1
JOIN cc_prod t2 ON t1.cate1=t2.cate1 AND t1.cate1=t2.cate2
JOIN cc_prod t3 ON t1.cate2=t3.cate1 AND t1.cate2=t3.cate2
where t1.cate1 != t1.cate2
基于swing的推荐物品相似度
swing算法的核心公式如下,其中
U
i
U_i
Ui表示购买类目
i
i
i的用户集合,
U
j
U_j
Uj表示购买类目
j
j
j的用户集合,
I
u
I_u
Iu表示用户
u
u
u购买的类目集合,
I
v
I_v
Iv表示用户
v
v
v购买的类目集合,具体算法解释可以参考swing:一种工业界广泛使用的召回算法,用SQL实现时也分为三步
user_cate
获得user和cate建立的边关系,注意这里不考虑订单量了;这里测试发现cate能接受的数据量最多采样为20000,再多就基本跑不动了uu_cate
计算任意两个user共同购买cate的配对- 最后计算任意两个user共同购买的任意两个cate的配对(t1)和任意两个user共同购买cate的数量cate_cnt(t2),将1 / (1 + cate_cnt)在任意两个cate的配对里累加,即为最后的cate相似度
with user_cate as (
select
user_id,
cate_id
from (
select
user_id,
category_id as cate_id,
row_number() over(partition by category_id order by rand(47)) as rn
from
project.my_table
where
p_date = '{{ds_nodash}}'
and user_id > 0
and category_id > 0
and pay_order_cnt > 0
) t
where rn < 20000
group by user_id,
cate_id
),
uu_cate as (
select
uc1.user_id as u1,
uc2.user_id as u2,
uc1.cate_id as cate
from user_cate as uc1
left join user_cate as uc2
on uc1.cate_id = uc2.cate_id
where uc1.user_id != uc2.user_id
)
select
cate1,
cate2,
sum(1 / (1 + cate_cnt)) as sim
from (
select
uuc1.u1,
uuc1.u2,
uuc1.cate as cate1,
uuc2.cate as cate2
from uu_cate as uuc1
left join uu_cate as uuc2
on uuc1.u1 = uuc2.u1 and uuc1.u2 = uuc2.u2
where uuc1.cate != uuc2.cate
) t1
left join (
select
u1,
u2,
count(cate) as cate_cnt
from uu_cate
group by u1, u2
) t2 on t1.u1 = t2.u1 and t1.u2 = t2.u2
group by cate1, cate2