A simple illustration of implementing data streaming and analytics using Snowflake. I've created this repository as I haven't found enough resources online which implement the workflow using Snowflake's new UI, Snowsight.
In simple terms, snowflake is a cloud solution to data warehousing, which provides compute resources to run SQL commands on databases and query storage solutions.
- Option to choose the cloud provider instead of tightly binding to one of them (AWS/GCP/Azure)
- Provides an efficient serverless architecture
- Queries are stored in the cache for 24hrs saving compute time and resources
- Time-travel feature to revert to query IDs (starts 24hrs for standard account upto 90 days with Enterprise version)
- Decouples Compute and Storage and saves costs with columnar compression and other techniques.
- Scales efficiently unlike other solutions like Redshift, which scales by compute nodes, charging per hour.
- Doesn't charge for data ingress and metadata storage.
- Provides in-house visualization tools to understand the results while querying.
- Standard Data Extraction and Cleansing should be performed to build efficient results. Link to the Data Preparation Considerations documentation to follow before proceeding to the next step.
- An intermediate space to upload the files either from local storage or the cloud. Helps to successfully store the data and perform load operation when needed.
- The data is ingested into the database using the COPY command with the help of several file format attributes.
- Create a warehouse compute to query the results. Snowflake offers 3 serverless compute options depending upon the data load and can be scaled anytime.
Load Tesla Sales Data from cloud storage and perform analytics using the queries generated on Snowflake's warehouse
- AWS S3: To store processed data
- Snowflake: Data warehouse
- Snowpipe: To automate data loading
- AWS Quicksight: For Data Analytics
- Basic knowledge on SQL
- AWS account and basic knowledge cloud concepts
1. Signup for a snowflake account.
Choose your cloud provider (AWS in this project). Here is a quick youtube video to understand Snowflake's features.
Create a simple data warehousing using:
Snowflake Web Interface: Nativate to Admin
-> Warehouses
, and create a warehouse using default settings.
SnowSQL:
CREATE WAREHOUSE IF NOT EXISTS my_warehouse
WAREHOUSE_SIZE = 'X-SMALL'
WAREHOUSE_TYPE = 'STANDARD'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
use warehouse my_warehouse;
create database test_db;
Use test_db;
CREATE TABLE TESLA_DATA (
Date date,
Open_value double,
High_value double,
Low_value double,
Close_value double,
Adj_Close double,
volume bigint
);
Store the provided TSLA.csv
in a S3 bucket.
- Stage the files using the command replacing
AWS_KEY_ID
andAWS_SECRET_KEY
:
CREATE OR REPLACE STAGE BULK_COPY_TESLA_STAGE URL='s3://<bucketname>/TSLA.csv'
CREDENTIALS=(AWS_KEY_ID='awskeyid' AWS_SECRET_KEY='awssecretkey');
- Staging using Integration:
It is always recommended to create an IAM role with S3 policies and connect it with Snowflake using Storage Integrations using ARNs to hide sensitive AWS account information
(AWS_KEY_ID,AWS_SECRET_KEY)
CREATE or replace STORAGE INTEGRATION S3_INTEGRATION
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
STORAGE_AWS_ROLE_ARN = '<AWS-ROLE-ARN>'
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = ('s3://<bucketname>/');
Run the command DESC INTEGRATION S3_INTEGRATION;
to find STORAGE_AWS_IAM_USER_ARN
and add it to the trusted relationships of your AWS Role.
Now stage the data using the command:
CREATE or REPLACE STAGE TESLA_DATA_STAGE
URL='s3://<bucketname>/TSLA.csv'
STORAGE_INTEGRATION = S3_INTEGRATION
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);
Now that the data is staged to TESLA_DATA_STAGE
, use the below command to copy the command and query the data:
COPY INTO test_db
FROM @TESLA_DATA_STAGE
PATTERN='.*.csv';
select * from data
Add pipe command before the COPY command to create a pipe:
create or replace pipe tesla_pipe auto_ingest=true as
COPY INTO test_db
FROM @TESLA_DATA_STAGE
PATTERN='.*.csv';
show pipes;
- Copy the
STORAGE_AWS_IAM_USER_ARN
from above to your bucket's event notification. You can do this using the AWS UI by navigating toS3
-><bucket_name>
->Properties
->Event Notification
->SQS queue ARN under DESTINATIon section
. - Make sure to select the required event types while selection.
- Navigate to your AWS dashboard. Click on
new dataset
->Snowflake
- Enter your account URL in the
database server
section. You can find it by clicking on account name present on bottom left on the Snowflake's Web Interface. - Enter the database details on the remaining prompts and you should have the data ready for analytics on Quicksight.
Other Insightful Information:
- You use snowSQL from terminal without the hazzle of logging in everytime using the credentials, you can add account parameters to the config file inside .snowsql.
Change the
username
toaccount identifier
found on the bottom right section of the snowflake Web Interface. You'll have to replace '.' with '-', to seperate the organization and account number. modify and enter theusername
andpassword
parameters as well. - File format can be created seperately in SnowSQL to enchance modular code and decrease code duplication. It can be created as follow:
create or replace file format CSV_FORMAT
--- Add parameters such as:
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1;