在业务场景中, 有时我们不想直接暴露数据存储空间给上游系统, 而需要设置1个landing Path 让上游系统发送数据
如图:
我们只需grant landing bucket 的权限给上游系统, 而上游系统是访问不了storage bucket的保证了数据隔离
但是至于怎么把放在landing bucket的文件自动导到Storage Bucket.
我们当然可以写代码, build 一些ETL service去完成, 这个service 定期观察landing bucket是否有新文件到。 有就把文件放入Storage Bucket
Storage Transfer Service 介绍
Google 已经具有个产品叫 Storage Transfer Service, 可以帮我们把文件在两个bucket 之间传输(甚至包括AWS S3 Bucket, 不在本文讨论范围内)
大概原理图:
各组件介绍
src bucket
就是storage transfer service 的源bucket
storage notification
要实现event driven的storage transfer serivce, 我们必须为src bucket 创建1个bucket notification, 一旦有新文件进入or 改动(事件类型可以配置), bucket notfication 会发送一条消息(新文件的元数据) 到1个pubsub topic
参考
https://cloud.google.com/storage/docs/pubsub-notifications
pubsub topic
pubsub 组件是实现event driven 的关键, topic 的主要作用就是用于接受storage notification发送的消息
pubsub pull subscription
subscription 就为了让 后面的 transfer streaming job 去消费消息(发送自storage notification)
Storage Transfer streaming job
这个就是整个流程的核心, 这个job 会24小时启动, 它会monitor pubsub subscription, 一旦有新的消息, 它就回去把文件从src bucket move 到 target bucket
并不难理解
1个具体例子
创建 buckets
首先 我们先创建 两个bucket
jason-hsbc-demo-src
jason-hsbc-demo-target
// define src bucket
resource "google_storage_bucket" "bucket-jason-hsbc-demo-src" {
name = "jason-hsbc-demo-src"
project = var.project_id
location = var.region_id
}
//define target bucket
resource "google_storage_bucket" "bucket-jason-hsbc-demo-target" {
name = "jason-hsbc-demo-target"
project = var.project_id
location = var.region_id
}
创建 pubsub topic 和subscription
topic: topic-sts-demo
subscription: subscription-sts-demo
//define a pubsub topic
resource "google_pubsub_topic" "topic_sts_demo" {
name = "topic-sts-demo"
project = var.project_id
}
//define a pubsub subscription
resource "google_pubsub_subscription" "subscription_sts_demo" {
name = "subscription-sts-demo"
topic = google_pubsub_topic.topic_sts_demo.name
project = var.project_id
}
分配权限
- 把topic 的publish 权限 分配给gcs agent account , 否则storage notification 没有权限发送消息给pubsub
resource "google_pubsub_topic_iam_binding" "topic_sts_demo_binding" {
topic = google_pubsub_topic.topic_sts_demo.id
role = "roles/pubsub.publisher"
members = ["serviceAccount:${var.gcs_sa}"]
}
至于如何查出当前gcp 项目的gcs agent account , 可以用下面命令获得:
[gateman@manjaro-x13 chapter-01]$ gcloud storage service-agent
service-912156613264@gs-project-accounts.iam.gserviceaccount.com
也可以从下面document里查询
https://cloud.google.com/iam/docs/service-agents
- 把subscription 的Read/Edit 权限分配给 storage transfer service 的 agent account, 注意这个不是上面那个gcs agent account. 两个不同的
resource "google_pubsub_subscription_iam_binding" "subscription_sts_demo_binding" {
subscription = google_pubsub_subscription.subscription_sts_demo.name
role = "roles/editor"
members = ["serviceAccount:${var.sts_sa}"]
}
- 把 两个bucket的读写权限都grant 给 storage transfer service 的 agent account
注意, 实际上transfer service agent account 需要src bucket的 storage.buckets.get 权限, 建议grant object.admin role, 如果只给objectUser role的话会有如下错误
Error: googleapi: Error 400: Failed to obtain the location of the GCS bucket jason-hsbc-demo-src Additional details: project-912156613264@storage-transfer-service.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., failedPrecondition
resource "google_storage_bucket_iam_binding" "bucket-jason-hsbc-demo-target-binding" {
bucket = google_storage_bucket.bucket-jason-hsbc-demo-target.name
role = "roles/storage.admin"
members = ["serviceAccount:${var.sts_sa}"]
}
resource "google_storage_bucket_iam_binding" "bucket-jason-hsbc-demo-src-binding" {
bucket = google_storage_bucket.bucket-jason-hsbc-demo-src.name
role = "roles/storage.admin"
members = ["serviceAccount:${var.sts_sa}"]
}
为src bucket 创建storage notification
注意要正确指定上面create的pubsub topic
// https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_notification.html
// define a bucket notification
resource "google_storage_notification" "notification" {
bucket = google_storage_bucket.bucket-jason-hsbc-demo-src.name
payload_format = "JSON_API_V1"
topic = google_pubsub_topic.topic_sts_demo.id
event_types = ["OBJECT_FINALIZE", "OBJECT_METADATA_UPDATE"]
custom_attributes = {
new-attribute = "new-attribute-value"
}
depends_on = [google_pubsub_topic_iam_binding.topic_sts_demo_binding]
}
最后一步, 基于pubsub subscription 创建1个storage transfer stream job
参考https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_transfer_job.html
resource "google_storage_transfer_job" "transfer-job-sts-demo" {
description = "transfer-job-sts-demo"
project = var.project_id
transfer_spec {
transfer_options {
overwrite_objects_already_existing_in_sink = true
overwrite_when = "ALWAYS"
delete_objects_from_source_after_transfer = true
}
gcs_data_source {
bucket_name = google_storage_bucket.bucket-jason-hsbc-demo-src.name
}
gcs_data_sink {
bucket_name = google_storage_bucket.bucket-jason-hsbc-demo-target.name
}
}
event_stream {
name = format("projects/%s/subscriptions/%s", var.project_id, google_pubsub_subscription.subscription_sts_demo.name)
}
depends_on = [google_storage_notification.notification,
google_pubsub_subscription_iam_binding.subscription_sts_demo_binding,
google_storage_bucket_iam_binding.bucket-jason-hsbc-demo-target-binding,
google_storage_bucket_iam_binding.bucket-jason-hsbc-demo-src-binding]
}
测试
我们先给src bucket 上传1个文件
[gateman@manjaro-x13 chapter-01]$ gsutil cp *csv gs://jason-hsbc-demo-src
Copying file://supermarket_sales.csv [Content-Type=text/csv]...
- [1 files][128.4 KiB/128.4 KiB]
Operation completed over 1 objects/128.4 KiB.
检查bucket 文件
[gateman@manjaro-x13 chapter-01]$ gsutil ls gs://jason-hsbc-demo-src
gs://jason-hsbc-demo-src/chapter-01-steps.sql
[gateman@manjaro-x13 chapter-01]$ gsutil ls gs://jason-hsbc-demo-target
gs://jason-hsbc-demo-target/Untitled-1.mak
gs://jason-hsbc-demo-target/chapter-01-steps.sql
gs://jason-hsbc-demo-target/supermarket_sales.csv
可以见到 csv 文件已经被传送到target bucket
我们也可以从UI 上查看transfer job的状态