Understanding your application performance with custom metrics on the new workload profiles in Azure Container Apps

8 min readApr 5, 2023

You want to run a serverless container workload in Azure Container Apps but still need to understand performance data of what is happening inside the container and maybe even need to look at real-time data by looking at custom metrics via Azure Monitor using Grafana? Here is what you need to know:

In the existing Azure Container App service there is no way for you to get runtime metrics, custom metrics or performance data out of the container and into a metric store like Azure Monitor (you can track the progress of that feature request here). However you can run a telegraf sidecar inside a Container App and export any metrics you want and forward them to see them in Azure Managed Grafana.

This is also especially relevant for the new workload profile concept that has been introduced in Azure Container Apps and will effectively allow you to run multiple containers on a single host while sharing the total memory and cpu resources on that machine.

How to get runtime and custom metrics from inside your container into your Dashboards

You have come to the right place to learn more👇

There are obviously lots of different ways on how to export metrics from your runtime, your code or your containers. While you can import a custom SDK for your own applications that is for example in Azure is using the Azure Application Insights SDK to push runtime and custom metrics via instrumentation to the Azure Service, this will not work for existing containers that you might want to integrate as well. When using containers as abstraction layer, in which you cannot or will not change your containers to match the underlying hosting platform you probably want to use a different option.

There is an established cloud native standard of using the Prometheus data model to publish time series of timestamped values in a very general data format and then query them using PromQL (or in our case Kusto) to understand what is happening in your application and I want to show you on this approach can be leveraged for your applications in Azure Container Apps.

To avoid writing custom code I would recommend to leverage telegraf which allows you to scrape or ingest metric data using the prometheus input plugin and luckily there is also an output plugin for Azure Monitor that can target the Managed Monitor managed service. In the following I will show you how to glue these together to get insightfull performance data like the following from your Quarkus application into Azure Managed Grafana:

Here you can track the status of all jobs that have been submitted to the job engine

For my little sample application I am using Quarkus because it allows to build Java based containers that can be started quickly and an ecosystem that is really fun to work with. To keep it simple I am using micrometer to publish metrics from my Quarkus application on a dedicated endpoint. Some simple guidance on how to this for your own application can be found here. For Quarkus all you need to do is reference the dependency in your pom.xml:

<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-micrometer-registry-prometheus</artifactId>
</dependency>

The application is a scale out job engine that is processing requests from an EventHub, works some time exclusively on each message in each replica and then unlocks itself and processes the next message. Here I am using Dapr to avoid writing code for EventHub and instead have Dapr post a message to an endpoint for each message that is subscribing to from the EventHub. From the outside world I would like to observe the cpu/memory performance metrics of the running process and the status of each job in Grafana.

  @Inject
    MeterRegistry registry;

 @POST
    @Produces(MediaType.TEXT_PLAIN)
    public Response receive(CloudEvent<HashMap> event) {

    HashMap<String, String> data;
    logger.info(String.format("Consumed contenttype: %s", event.getDatacontenttype()));
    logger.info(String.format("Consumed event id: %s", event.getId()));

    registry.counter("messages_counter", Tags.of("name", "accepted")).increment();
  }

All that was needed is the incrementation of the MeterRegistry for each job that I am receiving from Dapr. You can find the whole code here. Once that code is running you can navigate to the /q/metrics endpoint and can see that a lot of metrics — including the custom metrics from the code are printed out in a plain text format like this:

# HELP http_server_connections_seconds_max  
# TYPE http_server_connections_seconds_max gauge
http_server_connections_seconds_max 5.917031496
# HELP http_server_connections_seconds  
# TYPE http_server_connections_seconds summary
http_server_connections_seconds_active_count 1.0
http_server_connections_seconds_duration_sum 5.917009747

The next step is to create a telegraf config which will allow the agent to scrape this data, enhance it with some custom metadata like instance name or region and configure the input and output plugins accordingly. Since the normal telegraf configuration is a file with values I reconfigured it to read configuration values from environment variables that I can automatically set during deployment time. An example of the configuration file can be found here which I included in my custom dockerfile for the telegraf agent.


###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# https://github.com/influxdata/telegraf/blob/release-1.24/plugins/inputs/prometheus/README.md
# # Read metrics from one or many prometheus clients
[[inputs.prometheus]]
#   ## An array of urls to scrape metrics from.
  urls = ["${PROMETHEUS_URL}"]

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################


# # Send aggregate metrics to Azure Monitor
# https://github.com/influxdata/telegraf/blob/release-1.24/plugins/outputs/azure_monitor/README.md
[[outputs.azure_monitor]]
#   ## Timeout for HTTP writes.
  timeout = "20s"
#
#   ## Set the namespace prefix, defaults to "Telegraf/<input-name>".
  #namespace_prefix = "engine/"
#
#   ## Azure Monitor doesn't have a string value type, so convert string
#   ## fields to dimensions (a.k.a. tags) if enabled. Azure Monitor allows
#   ## a maximum of 10 dimensions so Telegraf will only send the first 10
#   ## alphanumeric dimensions.
  strings_as_dimensions = false
#
#   ## Both region and resource_id must be set or be available via the
#   ## Instance Metadata service on Azure Virtual Machines.
#   #
#   ## Azure Region to publish metrics against.
#   ##   ex: region = "southcentralus"
  region = "${LOCATION}"
#   #
#   ## The Azure Resource ID against which metric will be logged, e.g.
#   ##   ex: resource_id = "/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/virtualMachines/<vm_name>"
  resource_id = "${RESOURCE_ID}"

Once the telegraf agent is built into a container image we need to configure the sidecar of our Azure Container App to run it. A nice aspect of the telegraf Azure Monitor plugin is that is supports authentication from a Managed Identity which we will use in the definition of the deployment manifest and also grant it the required permission to allow it to publish metrics to our Azure Monitor workspace.

resource uami 'Microsoft.ManagedIdentity/userAssignedIdentities@2018-11-30' = {
  name: 'engine-msi'
  location: location
}

var metricsPublisherlRoleDefinitionId = '/providers/Microsoft.Authorization/roleDefinitions/3913510d-42f4-4e42-8a64-420c390055eb'

resource metricsRoleAssignment 'Microsoft.Authorization/roleAssignments@2020-08-01-preview' = {
  name: guid(subscription().subscriptionId, uami.id)
  scope: resourceGroup()
  properties: {
    roleDefinitionId: metricsPublisherlRoleDefinitionId
    principalId: uami.properties.principalId
    principalType: 'ServicePrincipal'
  }
}

resource containerApp 'Microsoft.App/containerapps@2022-01-01-preview' = {
  name: 'engine'
  kind: 'containerapp'
  location: location
  identity: {
    type: 'UserAssigned'
    userAssignedIdentities: {
      '${uami.id}': {}
    }
  }
  properties: {
    managedEnvironmentId: resourceId('Microsoft.App/managedEnvironments', environmentName)
    configuration: {
      activeRevisionsMode: 'single'
      workloadProfileName: 'f4-compute'
...
  template: {
      containers: [
      {
          image: 'denniszielke/telegraf:opt'
          terminationGracePeriodSeconds: 5
          name: 'telegraf'
          resources: {
            cpu: '0.5'
            memory: '1Gi'
          }          
          env:[
            {
              name: 'AZURE_TENANT_ID'
              value: '${subscription().tenantId}'
            }
            {
              name: 'AZURE_CLIENT_ID'
              value: uami.properties.clientId
            }
            {
              name: 'RESOURCE_ID'
              value: '/subscriptions/${subscription().subscriptionId}/resourceGroups/${resourceGroup().name}/providers/Microsoft.App/containerapps/engine'
            }
            {
              name: 'LOCATION'
              value: location
            }
            {
              name: 'INSTANCE'
              value: 'engine'
            }
            {
              name: 'PROMETHEUS_URL'
              value: 'http://localhost:8080/q/metrics'
            }
          ]
        }
      }
    }
  }

In the environment variables of the telegraf container we are setting the AZURE_TENANT_ID and AZURE_CLIENT_ID of the used Managed Identity and configure the RESOURCE_ID so that we can differentiate between the metrics of different applications along with our desired metadata for LOCATION and INSTANCE and the PROMETHEUS_URL that will be used to scrape metrics from. Once all that is done and our application has been deployed we should see that metrics are being ingested into our workspace.

To make it a bit more interesting I have also activated the new workload profiles for the Azure Container App Environment which allows to provision multiple Azure Container App replicas on a set of managed virtual machines from a predefined set of cpu/memory configurations. The existing runtime experience has been renamed to Consumption and will going forward be optimised towards fast-scaling, event-driven and per-second-execution pricing.

The new workload profiles offer a set of different CPU/memory sizings that go beyond the existing 1:2 cpu/memory ratio which is not very suitable for Java applications or other workloads that need more memory than cpu resources. Here is a sample of some of the possible profiles:

Name         Cores    MemoryGiB    Category
-----------  -------  -----------  ---------------
D4           4        16           GeneralPurpose
D8           8        32           GeneralPurpose
D16          16       64           GeneralPurpose
E4           4        32           MemoryOptimized
E8           8        64           MemoryOptimized
E16          16       128          MemoryOptimized
Consumption  4        8            Consumption

During the deployment of your Azure Container App Environment you can reference the profiles that you want to be available for your application — in addition to the consumption profile that is still there and can be used side by side in your environment. Once that is done you can decide per app on which profile you want your replicas to be distributed by using the new workloadProfileName parameter in your spec.

Just as you know from Kubernetes you cannot allocate the whole available memory on the host to your app because the underlying container runtime and daemonsets also still need some resources. In total however you are getting a better price/ performance ratio for long running containers on suitable workload profile compared to deploying on the consumption plan.

   workloadProfiles: [
      {
        name: 'consumption'
        workloadProfileType: 'Consumption'
      }
      {
        name: 'f4-compute'
        workloadProfileType: 'F4'
        MinimumCount: 1
        MaximumCount: 3
      }
    ]

Getting back to the original concern of getting metrics into dashboard you can now open up Azure Managed Grafana and pick your metrics using the advanced dialog (I am unsure why your resources are not showing up in the treeview) by specifying your subscription, resource group, type of resource and resource name as seen below:

The treeview will not show your metrics source so you have to add it manually via the dialog

Once that is done you can query the data that has been ingested and create nice Graphs on the full set of runtime as well as your custom metrics.

Here you can see the overall diagram of the demo application architecture

I hope this was helpful and helps you to be productive while we wait for native custom metric support to become available in Azure Container Apps. The demo project along with deployment instructions can be found in my GitHub repo.

Understanding your application performance with custom metrics on the new workload profiles in Azure Container Apps

Written by Dennis Zielke