Skip to content

Commit

Permalink
Add ability to load into BigQuery from GCP Quickstart Examples (closes
Browse files Browse the repository at this point in the history
  • Loading branch information
jbeemster committed Nov 28, 2022
1 parent 9f33965 commit 6e135ef
Show file tree
Hide file tree
Showing 12 changed files with 400 additions and 85 deletions.
32 changes: 22 additions & 10 deletions terraform/gcp/pipeline/default/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,32 @@

## Providers

No providers.
| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | ~> 3.90.1 |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_bad_1_topic"></a> [bad\_1\_topic](#module\_bad\_1\_topic) | snowplow-devops/pubsub-topic/google | 0.1.0 |
| <a name="module_bad_rows_topic"></a> [bad\_rows\_topic](#module\_bad\_rows\_topic) | snowplow-devops/pubsub-topic/google | 0.1.0 |
| <a name="module_bigquery_loader"></a> [bigquery\_loader](#module\_bigquery\_loader) | snowplow-devops/bigquery-loader-pubsub-ce/google | 0.1.0 |
| <a name="module_collector_lb"></a> [collector\_lb](#module\_collector\_lb) | snowplow-devops/lb/google | 0.1.0 |
| <a name="module_collector_pubsub"></a> [collector\_pubsub](#module\_collector\_pubsub) | snowplow-devops/collector-pubsub-ce/google | 0.2.2 |
| <a name="module_enrich_pubsub"></a> [enrich\_pubsub](#module\_enrich\_pubsub) | snowplow-devops/enrich-pubsub-ce/google | 0.1.2 |
| <a name="module_enriched_topic"></a> [enriched\_topic](#module\_enriched\_topic) | snowplow-devops/pubsub-topic/google | 0.1.0 |
| <a name="module_pipeline_db"></a> [pipeline\_db](#module\_pipeline\_db) | snowplow-devops/cloud-sql/google | 0.1.1 |
| <a name="module_postgres_db"></a> [postgres\_db](#module\_postgres\_db) | snowplow-devops/cloud-sql/google | 0.1.1 |
| <a name="module_postgres_loader_bad"></a> [postgres\_loader\_bad](#module\_postgres\_loader\_bad) | snowplow-devops/postgres-loader-pubsub-ce/google | 0.2.1 |
| <a name="module_postgres_loader_enriched"></a> [postgres\_loader\_enriched](#module\_postgres\_loader\_enriched) | snowplow-devops/postgres-loader-pubsub-ce/google | 0.2.1 |
| <a name="module_raw_topic"></a> [raw\_topic](#module\_raw\_topic) | snowplow-devops/pubsub-topic/google | 0.1.0 |

## Resources

No resources.
| Name | Type |
|------|------|
| [google_bigquery_dataset.bigquery_db](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset) | resource |
| [google_storage_bucket.bq_loader_dead_letter_bucket](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket) | resource |

## Inputs

Expand All @@ -35,17 +42,19 @@ No resources.
| <a name="input_iglu_server_dns_name"></a> [iglu\_server\_dns\_name](#input\_iglu\_server\_dns\_name) | The DNS name of your Iglu Server | `string` | n/a | yes |
| <a name="input_iglu_super_api_key"></a> [iglu\_super\_api\_key](#input\_iglu\_super\_api\_key) | A UUIDv4 string to use as the master API key for Iglu Server management | `string` | n/a | yes |
| <a name="input_network"></a> [network](#input\_network) | The name of the network to deploy within | `string` | n/a | yes |
| <a name="input_pipeline_db_name"></a> [pipeline\_db\_name](#input\_pipeline\_db\_name) | The name of the database to connect to | `string` | n/a | yes |
| <a name="input_pipeline_db_password"></a> [pipeline\_db\_password](#input\_pipeline\_db\_password) | The password to use to connect to the database | `string` | n/a | yes |
| <a name="input_pipeline_db_username"></a> [pipeline\_db\_username](#input\_pipeline\_db\_username) | The username to use to connect to the database | `string` | n/a | yes |
| <a name="input_postgres_db_name"></a> [postgres\_db\_name](#input\_postgres\_db\_name) | The name of the database to connect to | `string` | n/a | yes |
| <a name="input_postgres_db_password"></a> [postgres\_db\_password](#input\_postgres\_db\_password) | The password to use to connect to the database | `string` | n/a | yes |
| <a name="input_postgres_db_username"></a> [postgres\_db\_username](#input\_postgres\_db\_username) | The username to use to connect to the database | `string` | n/a | yes |
| <a name="input_prefix"></a> [prefix](#input\_prefix) | Will be prefixed to all resource names. Use to easily identify the resources created | `string` | n/a | yes |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The project ID in which the stack is being deployed | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | The name of the region to deploy within | `string` | n/a | yes |
| <a name="input_ssh_ip_allowlist"></a> [ssh\_ip\_allowlist](#input\_ssh\_ip\_allowlist) | The list of CIDR ranges to allow SSH traffic from | `list(any)` | n/a | yes |
| <a name="input_subnetwork"></a> [subnetwork](#input\_subnetwork) | The name of the sub-network to deploy within | `string` | n/a | yes |
| <a name="input_bigquery_db_enabled"></a> [bigquery\_db\_enabled](#input\_bigquery\_db\_enabled) | Whether to enable loading into a BigQuery Dataset | `bool` | `false` | no |
| <a name="input_labels"></a> [labels](#input\_labels) | The labels to append to the resources in this module | `map(string)` | `{}` | no |
| <a name="input_pipeline_db_authorized_networks"></a> [pipeline\_db\_authorized\_networks](#input\_pipeline\_db\_authorized\_networks) | The list of CIDR ranges to allow access to the Pipeline Database over | <pre>list(object({<br> name = string<br> value = string<br> }))</pre> | `[]` | no |
| <a name="input_pipeline_db_tier"></a> [pipeline\_db\_tier](#input\_pipeline\_db\_tier) | The instance type to assign to the deployed Cloud SQL instance | `string` | `"db-g1-small"` | no |
| <a name="input_postgres_db_authorized_networks"></a> [postgres\_db\_authorized\_networks](#input\_postgres\_db\_authorized\_networks) | The list of CIDR ranges to allow access to the Pipeline Database over | <pre>list(object({<br> name = string<br> value = string<br> }))</pre> | `[]` | no |
| <a name="input_postgres_db_enabled"></a> [postgres\_db\_enabled](#input\_postgres\_db\_enabled) | Whether to enable loading into a Postgres Database | `bool` | `false` | no |
| <a name="input_postgres_db_tier"></a> [postgres\_db\_tier](#input\_postgres\_db\_tier) | The instance type to assign to the deployed Cloud SQL instance | `string` | `"db-g1-small"` | no |
| <a name="input_ssh_key_pairs"></a> [ssh\_key\_pairs](#input\_ssh\_key\_pairs) | The list of SSH key-pairs to add to the servers | <pre>list(object({<br> user_name = string<br> public_key = string<br> }))</pre> | `[]` | no |
| <a name="input_ssl_information"></a> [ssl\_information](#input\_ssl\_information) | The ID of an Google Managed certificate to bind to the load balancer | <pre>object({<br> enabled = bool<br> certificate_id = string<br> })</pre> | <pre>{<br> "certificate_id": "",<br> "enabled": false<br>}</pre> | no |
| <a name="input_telemetry_enabled"></a> [telemetry\_enabled](#input\_telemetry\_enabled) | Whether or not to send telemetry information back to Snowplow Analytics Ltd | `bool` | `true` | no |
Expand All @@ -55,6 +64,9 @@ No resources.

| Name | Description |
|------|-------------|
| <a name="output_bigquery_db_dataset_id"></a> [bigquery\_db\_dataset\_id](#output\_bigquery\_db\_dataset\_id) | The ID of the BigQuery dataset where your data is being streamed |
| <a name="output_bq_loader_bad_rows_topic_name"></a> [bq\_loader\_bad\_rows\_topic\_name](#output\_bq\_loader\_bad\_rows\_topic\_name) | The name of the topic for bad rows emitted from the BigQuery loader |
| <a name="output_bq_loader_dead_letter_bucket_name"></a> [bq\_loader\_dead\_letter\_bucket\_name](#output\_bq\_loader\_dead\_letter\_bucket\_name) | The name of the GCS bucket for dead letter events emitted from the BigQuery loader |
| <a name="output_collector_ip_address"></a> [collector\_ip\_address](#output\_collector\_ip\_address) | The IP address for the Pipeline Collector |
| <a name="output_db_ip_address"></a> [db\_ip\_address](#output\_db\_ip\_address) | The IP address of the database where your data is being streamed |
| <a name="output_db_port"></a> [db\_port](#output\_db\_port) | The port of the database where your data is being streamed |
| <a name="output_postgres_db_ip_address"></a> [postgres\_db\_ip\_address](#output\_postgres\_db\_ip\_address) | The IP address of the database where your data is being streamed |
| <a name="output_postgres_db_port"></a> [postgres\_db\_port](#output\_postgres\_db\_port) | The port of the database where your data is being streamed |
52 changes: 52 additions & 0 deletions terraform/gcp/pipeline/default/bigquery.terraform.tfvars
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Will be prefixed to all resource names
# Use this to easily identify the resources created and provide entropy for subsequent environments
prefix = "sp"

# The project to deploy the infrastructure into
project_id = "PROJECT_ID_TO_DEPLOY_INTO"

# Where to deploy the infrastructure
region = "REGION_TO_DEPLOY_INTO"

# --- Default Network
# Update to the network you would like to deploy into
#
# Note: If you opt to use your own network then you will need to define a subnetwork to deploy into as well
network = "default"
subnetwork = ""

# --- SSH
# Update this to your IP Address
ssh_ip_allowlist = ["999.999.999.999/32"]
# Generate a new SSH key locally with `ssh-keygen`
# ssh-keygen -t rsa -b 4096
ssh_key_pairs = [
{
user_name = "snowplow"
public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQA0jSi9//bRsHW4M6czodTs6smCXsxZ0gijzth0aBmycE= [email protected]"
}
]

# --- Iglu Server Configuration
# Iglu Server DNS output from the Iglu Server stack
iglu_server_dns_name = "http://CHANGE-TO-MY-IGLU-IP"
# Used for API actions on the Iglu Server
# Change this to the same UUID from when you created the Iglu Server
iglu_super_api_key = "00000000-0000-0000-0000-000000000000"

# --- Snowplow BigQuery Loader
bigquery_db_enabled = true

# See for more information: https://registry.terraform.io/modules/snowplow-devops/collector-pubsub-ce/google/latest#telemetry
# Telemetry principles: https://docs.snowplowanalytics.com/docs/open-source-quick-start/what-is-the-quick-start-for-open-source/telemetry-principles/
user_provided_id = ""
telemetry_enabled = true

# --- SSL Configuration (optional)
ssl_information = {
certificate_id = ""
enabled = false
}

# --- Extra Labels to append to created resources (optional)
labels = {}
101 changes: 84 additions & 17 deletions terraform/gcp/pipeline/default/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -109,20 +109,22 @@ module "enrich_pubsub" {
}

# 4. Deploy Postgres Loader
module "pipeline_db" {
module "postgres_db" {
source = "snowplow-devops/cloud-sql/google"
version = "0.1.1"

name = "${var.prefix}-pipeline-db"
count = var.postgres_db_enabled ? 1 : 0

name = "${var.prefix}-postgres-db"

region = var.region
db_name = var.pipeline_db_name
db_username = var.pipeline_db_username
db_password = var.pipeline_db_password
db_name = var.postgres_db_name
db_username = var.postgres_db_username
db_password = var.postgres_db_password

authorized_networks = var.pipeline_db_authorized_networks
authorized_networks = var.postgres_db_authorized_networks

tier = var.pipeline_db_tier
tier = var.postgres_db_tier

labels = var.labels
}
Expand All @@ -131,6 +133,8 @@ module "postgres_loader_enriched" {
source = "snowplow-devops/postgres-loader-pubsub-ce/google"
version = "0.2.1"

count = var.postgres_db_enabled ? 1 : 0

name = "${var.prefix}-pg-loader-enriched-server"

network = var.network
Expand All @@ -145,11 +149,11 @@ module "postgres_loader_enriched" {
purpose = "ENRICHED_EVENTS"
schema_name = "atomic"

db_instance_name = module.pipeline_db.connection_name
db_port = module.pipeline_db.port
db_name = var.pipeline_db_name
db_username = var.pipeline_db_username
db_password = var.pipeline_db_password
db_instance_name = join("", module.postgres_db.*.connection_name)
db_port = join("", module.postgres_db.*.port)
db_name = var.postgres_db_name
db_username = var.postgres_db_username
db_password = var.postgres_db_password

# Linking in the custom Iglu Server here
custom_iglu_resolvers = local.custom_iglu_resolvers
Expand All @@ -164,6 +168,8 @@ module "postgres_loader_bad" {
source = "snowplow-devops/postgres-loader-pubsub-ce/google"
version = "0.2.1"

count = var.postgres_db_enabled ? 1 : 0

name = "${var.prefix}-pg-loader-bad-server"

network = var.network
Expand All @@ -178,11 +184,72 @@ module "postgres_loader_bad" {
purpose = "JSON"
schema_name = "atomic_bad"

db_instance_name = module.pipeline_db.connection_name
db_port = module.pipeline_db.port
db_name = var.pipeline_db_name
db_username = var.pipeline_db_username
db_password = var.pipeline_db_password
db_instance_name = join("", module.postgres_db.*.connection_name)
db_port = join("", module.postgres_db.*.port)
db_name = var.postgres_db_name
db_username = var.postgres_db_username
db_password = var.postgres_db_password

# Linking in the custom Iglu Server here
custom_iglu_resolvers = local.custom_iglu_resolvers

telemetry_enabled = var.telemetry_enabled
user_provided_id = var.user_provided_id

labels = var.labels
}

# 5. Deploy BigQuery Loader
module "bad_rows_topic" {
source = "snowplow-devops/pubsub-topic/google"
version = "0.1.0"

count = var.bigquery_db_enabled ? 1 : 0

name = "${var.prefix}-bq-bad-rows-topic"

labels = var.labels
}

resource "google_bigquery_dataset" "bigquery_db" {
count = var.bigquery_db_enabled ? 1 : 0

dataset_id = replace("${var.prefix}_pipeline_db", "-", "_")
location = var.region

labels = var.labels
}

resource "google_storage_bucket" "bq_loader_dead_letter_bucket" {
count = var.bigquery_db_enabled ? 1 : 0

name = "${var.prefix}-bq-loader-dead-letter"
location = var.region
force_destroy = true

labels = var.labels
}

module "bigquery_loader" {
source = "snowplow-devops/bigquery-loader-pubsub-ce/google"
version = "0.1.0"

count = var.bigquery_db_enabled ? 1 : 0

name = "${var.prefix}-bq-loader-server"

network = var.network
subnetwork = var.subnetwork
region = var.region
project_id = var.project_id

ssh_ip_allowlist = var.ssh_ip_allowlist
ssh_key_pairs = var.ssh_key_pairs

input_topic_name = module.enriched_topic.name
bad_rows_topic_name = join("", module.bad_rows_topic.*.name)
gcs_dead_letter_bucket_name = join("", google_storage_bucket.bq_loader_dead_letter_bucket.*.name)
bigquery_dataset_id = join("", google_bigquery_dataset.bigquery_db.*.dataset_id)

# Linking in the custom Iglu Server here
custom_iglu_resolvers = local.custom_iglu_resolvers
Expand Down
23 changes: 19 additions & 4 deletions terraform/gcp/pipeline/default/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,27 @@ output "collector_ip_address" {
value = module.collector_lb.ip_address
}

output "db_ip_address" {
output "postgres_db_ip_address" {
description = "The IP address of the database where your data is being streamed"
value = module.pipeline_db.first_ip_address
value = join("", module.postgres_db.*.first_ip_address)
}

output "db_port" {
output "postgres_db_port" {
description = "The port of the database where your data is being streamed"
value = module.pipeline_db.port
value = join("", module.postgres_db.*.port)
}

output "bigquery_db_dataset_id" {
description = "The ID of the BigQuery dataset where your data is being streamed"
value = join("", google_bigquery_dataset.bigquery_db.*.dataset_id)
}

output "bq_loader_dead_letter_bucket_name" {
description = "The name of the GCS bucket for dead letter events emitted from the BigQuery loader"
value = join("", google_storage_bucket.bq_loader_dead_letter_bucket.*.name)
}

output "bq_loader_bad_rows_topic_name" {
description = "The name of the topic for bad rows emitted from the BigQuery loader"
value = join("", module.bad_rows_topic.*.name)
}
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,19 @@ iglu_server_dns_name = "http://CHANGE-TO-MY-IGLU-IP"
iglu_super_api_key = "00000000-0000-0000-0000-000000000000"

# --- Snowplow Postgres Loader
pipeline_db_name = "snowplow"
pipeline_db_username = "snowplow"
postgres_db_enabled = true

postgres_db_name = "snowplow"
postgres_db_username = "snowplow"
# Change and keep this secret!
pipeline_db_password = "Hell0W0rld!2"
postgres_db_password = "Hell0W0rld!2"
# IP ranges that you want to query the Pipeline Postgres Cloud SQL instance from directly over the internet. An alternative access method is to leverage
# the Cloud SQL Proxy service which creates an IAM authenticated tunnel to the instance
#
# Details: https://cloud.google.com/sql/docs/postgres/sql-proxy
#
# Note: this exposes your data to the internet - take care to ensure your allowlist is strict enough
pipeline_db_authorized_networks = [
postgres_db_authorized_networks = [
{
name = "foo"
value = "999.999.999.999/32"
Expand All @@ -57,7 +59,7 @@ pipeline_db_authorized_networks = [
]
# Note: the size of the database instance determines the number of concurrent connections - each Postgres Loader instance creates 10 open connections so having
# a sufficiently powerful database tier is important to not running out of connection slots
pipeline_db_tier = "db-g1-small"
postgres_db_tier = "db-g1-small"

# See for more information: https://registry.terraform.io/modules/snowplow-devops/collector-pubsub-ce/google/latest#telemetry
# Telemetry principles: https://docs.snowplowanalytics.com/docs/open-source-quick-start/what-is-the-quick-start-for-open-source/telemetry-principles/
Expand Down
22 changes: 17 additions & 5 deletions terraform/gcp/pipeline/default/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -48,23 +48,29 @@ variable "iglu_super_api_key" {
sensitive = true
}

variable "pipeline_db_name" {
variable "postgres_db_enabled" {
description = "Whether to enable loading into a Postgres Database"
default = false
type = bool
}

variable "postgres_db_name" {
description = "The name of the database to connect to"
type = string
}

variable "pipeline_db_username" {
variable "postgres_db_username" {
description = "The username to use to connect to the database"
type = string
}

variable "pipeline_db_password" {
variable "postgres_db_password" {
description = "The password to use to connect to the database"
type = string
sensitive = true
}

variable "pipeline_db_authorized_networks" {
variable "postgres_db_authorized_networks" {
description = "The list of CIDR ranges to allow access to the Pipeline Database over"
default = []
type = list(object({
Expand All @@ -73,12 +79,18 @@ variable "pipeline_db_authorized_networks" {
}))
}

variable "pipeline_db_tier" {
variable "postgres_db_tier" {
description = "The instance type to assign to the deployed Cloud SQL instance"
type = string
default = "db-g1-small"
}

variable "bigquery_db_enabled" {
description = "Whether to enable loading into a BigQuery Dataset"
default = false
type = bool
}

variable "telemetry_enabled" {
description = "Whether or not to send telemetry information back to Snowplow Analytics Ltd"
type = bool
Expand Down
Loading

0 comments on commit 6e135ef

Please sign in to comment.