The sample demonstrates how Events can be processed in a streaming serverless pipeline, while using Observability and load testing to provide insight into performance capabilities.
- Solution Overview
- How to use the sample
- Well Architected Framework (WAF)
- Key Learnings
- 1. Observability is a critical prerequisite, to achieving performance
- 2. Load testing needs representative ratios to test pipeline processing
- 3. Validate test data early
- 4. Have a CI/CD pipeline
- 5. Secure and centralize configuration
- 6. Make your streaming pipelines replayable and idempotent
- 7. Ensure data transformation code is testable
- Azure Function logic
- EventHub & Azure Function scaling
- Infrastructure as Code
- Observability
- Load testing
The solution receives a stream of readings from IoT devices within buildings. These IoT devices are sending values for different sensors and events (e.g. temperature readings, movement senors, door triggers) to indicate the state of a room and building. This streaming pipeline demonstrates how serverless functions can be utilised to filter, process, and split the stream. As this is a stream processing platform, performance is critical to ensure that processing does not 'fall behind' the rate that new events are sent. Observability is implemented to give visibility into the performance of the system to indicate if the system is performing. Load testing is combined with observability to give an indication of system capability under load. The entire infrastructure deployment is orchestrated via Terraform.
The following shows the overall architecture of the solution.
Message payload
Field name | Description |
---|---|
deviceId | Device ID |
temperature | temperature reading sensor |
time | timestamp of value |
It makes use of the following azure services:
- Azure Event Hubs
- Azure Functions
- Azure IoT Device Telemetry Simulator
- Azure DevOps
- Application Insights
- Terraform
IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for list of deployed resources.
There are 3 major steps to running the sample. Follow each sub-page in order:
- Deploy the infrastructure with Terraform
- Deploy the Azure Function logic
- Run the load testing script
- Accounts
- Github account [Optional]
- Azure Account
- Permissions needed: ability to create and deploy to an azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group.
- Azure DevOps Project
- Permissions needed: ability to create service connections, pipelines and variable groups.
- Software
After a successful deployment, you should have the following resources:
- Resource group 1 (Terraform state & secrets storage)
- Azure Key Vault
- Storage account
- Resource group 2 (Temperature Events sample)
- Application insights
- Azure Key Vault
- Event Hub x4
- Azure Function x2
The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a workload. The framework consists of five pillars of architecture excellence: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability, and Security. This sample has been built while considering each of the pillars.
- Scalable costs + Pay for consumption:
- The technologies used in the sample, are able to start small and then be scaled up as loads increase. Reduce costs when volumes are low.
- Event Hubs:
- Have been placed together into an Event Hub namespace to pool resources.
- At higher loads when much higher TUs are required, they can be moved into individual namespaces to reduce TU contention. Or be moved to Azure Event Hubs Dedicated
- Azure Functions:
- Defined as "Serverless" to reduce cost when developing and experimenting at low data volumes.
- When moving to higher volume "production loads", the App Service Plan can be changed to using dedicated instances.
- See Azure Docs - Azure App Service plan overview for more information.
Covers the operations processes that keep an application running in production.
- Observability:
- The sample connects monitoring tools, to give insights such as distributed tracing, application maps, etc. Allowing insights on how the streaming pipeline is performing.
- See the section on Observability
- Infrastructure Provisioning:
- IaC (Infrastructure as Code) is used here to provision and deploy the infrastructure to Azure. A Terraform template is used to orchestrate the creation of resources.
- Code Deployment:
- Deployment pipelines are defined to push application logic to the Azure Function instances.
- Load testing:
- The load testing process here is defined as a script that can be run in an automated and repeatable way.
- See the section on Load Testing
- Performance Limitations
- Scaling
- Bottlenecks
- TODO: limits from partitions or TPUs?
- Optimising Azure Functions & Event Hubs batch sizes for throughput
- TODO: AMQP vs HTTPS on EventHubs?
- Peak Load Testing
- TODO: Peak loads for the testing tool
- TODO: Peak loads for the platform?
- Resiliancy in Architecture
- The key resiliency aspects are created through a queue based architecture. For queues the architecture is leveraging Azure Event Hubs which is highly reliable, with high availability and geo-redundancy.
- Reliability allowance for scalability and performance
- Azure Event Hubs and Azure Functions scale well. However for optimal performance they can (and should) be tuned for your platform behaviours.
- Event Hubs Scaling
- Secrets, connection strings, and other environmental variables are stored in Azure Key Vault. This prevents accidental leaking or committing secrets into a codebase. Azure docs - About Azure Key Vault secrets
- Access to Event Hubs are restricted via SAS tokens. Only services that hold the Shared Access Signature are able to interact with them. Azure docs - Authenticate access to Event Hubs resources using shared access signatures (SAS)
- The access security can be enhanced further by switching to using managed identities. Authenticate with a managed identity to access Event Hubs Resources
- Access to Azure portal, Application Insights, etc are restricted via native access polices built into Azure.
The following summarizes key learnings and best practices demonstrated by this sample solution:
- A system that cannot be measured, cannot be tuned for performance.
- A prerequisite for any system that prioritises performance, is a working observability system that can show key performance indicators.
- In streaming pipelines, the number of messages at each stage will vary. As messages are filtered and split, the messages going to each Event Hub output will be different to the ingestion Event Hub.
- In order to put each section under the expected production load, the load testing data needs to be crafted with representative ratios to exercise each section at the correct weight.
- If load testing data is not representative, an expensive section may be under tested. Which would falsely indicate that it would perform at production loads.
- It is important to do capacity planning on each stage of the pipeline. In this sample, it is expected that 50% of the devices will be filtered out in the first Azure Function (due not being the subset that we are interested in). It is expected that 20% of all temperature sensors will have values that need investigating and will go to the "bad temperature" Event Hub.
- When initially testing, the crafted test data was not created at the correct ratios. It resulted in no filtering and everything being hit at 100% instead of smaller ratios deeper in the pipeline. This lead to an initial over-provisioning when testing and planning.
- Validating our test data earlier would have prevented this oversight.
- This means including all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions, and data validation and transformation logic.
- There should also be a safe, repeatable process to move changes through dev, test and finally production.
- Maintain a central, secure location for sensitive configuration such as database connection strings that can be access by the appropriate services within the specific environment.
- Any example of this is securing secrets in KeyVault per environment, then having the relevant services query KeyVault for the configuration.
- Messages could be processed more than once. This can be due to failures & retries, or multiple workers processing a message.
- Idempotency ensures side effects are mitigated when messages are replayed in your pipelines.
- Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages.
- While it is possible to run tests against notebooks, by shifting tests left you increase developer productivity by increasing the speed of the feedback cycle.
The logic of the Azure Functions are kept simple to demonstrate the end to end pattern, and not complex logic within a Function.
All temperature sensors are sending their data to the evh-ingest
Event Hub. Different pipelines can then consume and filter to the subset that they want to focus on. In this pipeline we are filtering out all DeviceIds above 1,000.
// psuedoscript
if DeviceId < 1000
forward to evh-filteredevices
else DeviceId >=1000
ignore
Splits the feed based on the temperature value. Any value of 100ºC and over are too hot and should be actioned.
// psuedoscript
if DeviceId < 100
forward to evh-analytics
else DeviceId >=100 // it is too hot!
forward to evh-outofboundstemperature
- Scaling Out Azure Functions With Event Hubs Effectively | by Masayuki Ota
- Scaling Out Azure Functions With Event Hubs Effectively 2 | by Masayuki Ota
We are provisioning Resource Group, Azure KeyVault, Azure Function, Application Insight Azure Event Hubs using Terraform.
In order to decouple each components of applications, we modularized each components instead of keeping all components in main.tf. We referred official terraform site and artifact from CSE, cobalt. By modularizing, we can run test as a unit. For example, to provision azure functions, we need to provision both storage account and azure functions.
We need different environment managed by terraform but often they have different scope of permission and possibly different resources you are provisioning. For example, you might have TSI in dev but not in staging environment. And thus we separated environment by file layout.
Additional articles:
- Observability on Event Hubs. Overview | by Akira Kakkar
- Distributed Tracing Deep Dive for Eventhub Triggered Azure Function in App Insights | by Shervyna Ruan
- Calculating End-to-End Latency for Eventhub Triggered Azure functions with App Insights | by Shervyna Ruan
Utilizing the Azure dashboard can allow you to lay out the metrics of resources. Allowing you to quickly spot scaling and throughput issues.
To create your own:
- On the Azure portal, go to the dashboard.
- Add 4 "Metrics", one for each Event Hub.
- For each click
Edit in metrics
. Select the Event Hub name, then addIncoming messages
,Outgoing messages
andThrottled requests
. Set the aggregate for each asSum
.
- For each click
- Add 2 "Metrics", one for each Azure Function.
- Add the metric, and select
Function Execution Count
- Add the metric, and select
- Add
Application Map Application Insights
to see an end to end map of messages flowing through your infrastructure - Add Markdown widgets, to add some explanations to others viewing the dashboard
Azure Event Hubs & Azure functions offer built-in integration with Application Insights. With Application insights, we were able to gain insights to further improve our system by using various features/metrics that are available without extra configuration. Such as Application Map, and End-to-End Transaction details.
Application Map shows how the components in a system are interacting with each other. In addition, it shows some useful information such as the average of each transaction duration, number of scaled instances and so on. From Application Map, we were able to easily tell that our calls to the database are having some problems which became a bottleneck in our system. By clicking directly on the arrows, you can drill into some metrics and logs for those problematic transactions.
From Live Metrics, you can see how many instances has Azure Function scaled to in real time.
However, the number of function instances is not available from any default metrics at this point besides checking the number in real time. If you are interested in checking the number of scaled instances within a past period, you can query the logs in Log Analytics (within Application Insights) by using Kusto query. For example:
traces
| where ......// your logic to filter the logs
| summarize dcount(cloud_RoleInstance) by bin(timestamp, 5m)
| render columnchart
Distributed Tracing is one of the key reasons that Application Insights is able to offer useful features such as application map. In our system, events flow through several Azure Functions that are connected by Event Hubs. Nevertheless, Application Insights is able to collect this correlation automatically. The correlation context are carried in the event payload and passed to downstream services. You can easily see this correlation being visualized in either Application Maps or End-to-End Transaction Details:
Here is an article that explains in detail how distributed tracing works for Eventhub-triggered Azure Functions: Distributed Tracing Deep Dive for Event Hub Triggered Azure Function in Application Insights
End-to-End Transaction Detail comes with a visualization of each component's order and duration. You can also check the telemetries(traces, exceptions, etc) of each component from this view, which makes it easy to troubleshoot visually across components within the same transaction when an issue occurred.
As soon as we drilled down into the problematic transactions, we realized that our outgoing calls(for each event) are waiting for one and another to finish before making the next call, which is not very efficient. As an improvement, we changed the logic to make outgoing calls in parallel which resulted in better performance. Thanks to the visualization, we were able to easily troubleshoot and gain insight on how to further improve our system.
The load testing script allows you to quickly generate load against your infrastructure. You can then use the observability aspects to drill in and see how the architecture responds. Azure functions scaling up, Event Hub queue lengths getting longer, and stabilising, etc. It will assist in finding bottlenecks in any business logic you have added to the Azure Functions, and if your Azure Architecture is scaling as expected.
Getting started resources:
- Azure IoT Device Telemetry Simulator
- Azure Well Architected Framework - Performance testing
- Load test for real time data processing | by Masayuki Ota
The load testing can be invoked by the IoTSimulator.ps1 script. This can be run locally via an authenticated Azure CLI session, or in an Azure DevOps pipeline using the load testing azure-pipeline.yml pipeline definition.
The script will orchestrate the test by automating these steps:
- Create or use an existing Resource Group
- Create an Azure Container Instances resource.
- Deploy the Azure IoT Device Telemetry Simulator container, and scale it to the number of instances passed in as a parameter.
- Coordinate them to simulate the specified number of devices, and send a certain number of messages.
- Wait for it to complete.
- Remove the Azure Container Instances resource.
Test message payload
Field name | Description |
---|---|
deviceId | simulated deviceID |
temperature | random temperature, in a ratio of good/bad readings |
time | timestamp of value |