Milhouse on software, engineering, and Emacs.

A guide to measuring your Elixir app

TL;DR: This post is a more thorough follow up to a talk I gave at the last “São Paulo’s Elixir Users Group” Meetup, which unfortunately I forgot to record. In this post I will describe how we collect and visualize metrics at Xerpa using InfluxDB, Elixometer, and Grafana.

Word of warning: This will be a long post. Brace yourselves.

0th step: Why should I care?

Metrics are very important when you get your software out in the terrible and evil wasteland called production. Metrics are an essential part of the observability dimension of your application.

If you’re not convinced of this I urge you to watch the classic “Metrics, Metrics Everywhere” talk by Coda Hale and Chapter 8 of Building Microservices by Sam Newman.

Using and understanding metrics completely changed the way I think about systems in production, and I don’t say this lightly.

The “main idea” presented by both resources is something like this:

1st step: Collecting metrics

There are various forms of collecting metrics and in this post I will focus more on application metrics (i.e., metrics reported from inside the application). Collecting machine metrics is a boring and solved problem. Here at Xerpa we are using collectd to collect machine-level metrics like load-average, memory, disk consumption and so on.

As Erlang is a battle proven platform, it is no surprise that there are many available solutions to this problem. To name a few, there are Exometer, VMStats, Folsom and Wombat. In this post I will focus on Pinterest’s Elixometer - a thin Elixir wrapper around Exometer.

How metric reporting works under the covers

Exometer buffer-izes and aggregates metrics before sending them over the wire to some backend. A reporter is a module that translates the Exometer data into something the backend understands. If you ever change your storage backend, all you need to do is update the reporter configuration and you’re good to go. This design was popularized by the Metrics Java library.

Writing metrics to this buffer is very fast, and all actual reporting happens asynchronously in the background. Exometer handles retries and disconnects in the way you would expect from a library extracted from Riak.

I will discuss the configuration of Exometer and the reporters later, in the “Configure your reporters” section.

Manually reporting individual metrics

Elixometer makes it easy to report metrics by simply calling the correct update function for your metric type: update_counter, update_histogram, update_gauge and update_spiral.

update_counter("signup_user_count", 1)
update_histogram("histogram_for_time_to_fill_form", 2)
update_spiral("spiral_time_to_notify", 3)
update_gauge("total_jobs_in_queue_gauge", 4)

You can pretty much add these calls anywhere in your system. There are absolutely nothing special about them. They are simple function calls.

To understand exactly each metric type, check out exometer’s documentation

If you’re interested in timing the execution of a function, Elixometer provides you with a very convenient python-esque annotation, @timed:

# Timing a function. The metric name will be [:timed, :function]
@timed(key: "timed.function") # key is: prefix.dev.timers.timed.function
def function_that_is_timed do
  OtherModule.slow_method
end

The timer metric is actually a histogram, so you will have access to things like percentiles, mean, average, count, min and max values.

Measuring every single request to phoenix

Phoenix makes it very easy to measure every HTTP request ^1. All we need to do is create a Plug that will start a timer and register a callback to stop it before sending the HTTP response.

The MyApp.Plug.Metrics module is almost exactly what I have running in production:

defmodule MyApp.Plug.Metrics do
  @behaviour Plug

  use Elixometer

  @unit :milli_seconds

  def init(opts), do: opts
  def call(conn, _config) do
    # Incrementing a total http request count metric.
    update_counter("request_count", 1)

    # Here we start the timer for this one request.
    req_start_time = :erlang.monotonic_time(@unit)

    Plug.Conn.register_before_send conn, fn conn ->
      # This will run right before sending the HTTP response
      # giving us a pretty good measurement of how long it took
      # to generate the response.
      request_duration =
        :erlang.monotonic_time(@unit) - req_start_time

      conn |> metric_name |> update_histogram(request_duration)

      conn
    end
  end

  # Build the metric name based on the controller name and action
  defp metric_name(conn) do
    action_name = Phoenix.Controller.action_name(conn)
    controller  = Phoenix.Controller.controller_module(conn)
    "#{controller}\##{action_name}"
  end
end

Now, we need to attach this plug to the Phoenix controller definition. At the web.ex file, just add the plug to all controllers:

defmodule MyApp.Web do
  # ...
  def controller do
    quote do
      alias MyApp.Repo
      use Phoenix.Controller

      # ...

      plug MyApp.Plug.Metrics
    end
  end

  # ...
end

Voilá. With just that, we are now measuring every single request to our app. (See? If you have macros you don’t need inheritance.)

Channels can be measured just as easily. Check out this post if you’re interested in that.

In the section about Grafana, I will demonstrate these metrics can be visualized.

2nd step: Storing the metrics somewhere

Now that we’ve set up basic metrics collection, we need to store them somewhere for further analysis & visualization. At Xerpa, we are using InfluxDB for this task.

InfluxDB is an open source database written in Go specifically to handle time series data with high availability and high performance requirements. InfluxDB installs in minutes without external dependencies, yet is flexible and scalable enough for complex deployments.

InfluxDB has a very simple SQL-like query language and many nice features like continuous queries and automatic data purge via retention policies. InfluxDB (unlike Graphite) is also optimized for very sparse series. There is absolutely no problem creating a series, adding some data to it and then never touching it again. Check out their docs for more info.

Though it is still in its early days (still v0.12 at the time of this writing), we haven’t had any problems running it in production in the past 6 months.

InfluxDB is also part of a family of products called InfluxData, which aims to provide a full-fledged platform for dealing with time-series data. Other members of the family are Chronograf (for time-series visualization), Kapacitor (for time-series processing, alerting and anomaly detection), Telegraf (for time-series data collection).

Getting InfluxDB running

It is very easy to set up an InfluxDB instance. In this post, we will use docker for demonstration purposes. To run an InfluxDB node locally, just run:

$ docker run -d -p 8083:8083 -p 8086:8086 -t "tutum/influxdb:0.12"

Now, create a database for our tests:

$ curl -G "http://localhost:8086/query" --data-urlencode "q=CREATE DATABASE dev"
# => {"results":[{}]}

And we’re now set to write our application metrics.

We don’t use InfluxDB with Docker in production since we are Debian die-hards at Xerpa. The Influx folks maintain a Debian package and our installation in prod is pretty much a single dpkg -i influxdb.deb.

3rd step: Configure your reporters

Now that we have our storage up and running, we need to tell Exometer how to send it metrics.

First, we need to configure the package dependencies at mix.exs:

defp deps do
  [
    ######### Exometer dependency fixup
    {:elixometer, github: "pinterest/elixometer"},
    {:exometer_influxdb, github: "travelping/exometer_influxdb"},
    {:exometer_core, "~> 1.0", override: true},
    {:lager, "3.0.2", override: true},
    {:hackney, "~> 1.4.4", override: true}
  ]
end

Here we need to use [override: true] for lager, hackney and exometer_core because elixometer and exometer_influxdb don’t agree with their required versions.

After your usual mix deps.get; mix deps.compile, we need to configure elixometer and exometer OTP applications. In your config.exs file, add the following code:

config :elixometer, reporter: :exometer_report_influxdb,
  update_frequency: 5_000,
  env: Mix.env,
  metric_prefix: "myapp"

config :exometer_core, report: [
  reporters: [
    exometer_report_influxdb: [
      protocol: :http,
      host: "localhost",
      port: 8086,
      db: "dev"
    ]
  ]
]

With this, when starting your application you should see messages like this:

16:19:14.109 [info] Application lager started on node nonode@nohost
16:19:14.196 [info] Starting reporters with [{reporters,[{exometer_report_influxdb,[{protocol,http},{host,<<"localhost">>},{port,8086},{db,<<"lu
kla_dev">>},{tags,[{started_at,63629954320}]}]}]}]
16:19:14.197 [info] Application exometer_core started on node nonode@nohost
16:19:14.217 [info] Application elixometer started on node nonode@nohost
16:19:14.290 [info] InfluxDB reporter connecting success: [{protocol,http},{host,<<"localhost">>},{port,8086},{db,<<"dev">>},{tags,[{start
ed_at,63629954320}]}]
16:19:14.328 [info] Running MyApp.Endpoint with Cowboy using http on port 4000
16:19:16.976 [debug] Tzdata polling for update.
16:19:17.006 [warning] lager_error_logger_h dropped 84 messages in the last second that exceeded the limit of 50 messages/sec
16:19:18.569 [debug] Tzdata polling shows the loaded tz database is up to date.
08 May 16:19:21 - info: compiled 20 files into 2 files, copied 155 in 6852ms

4th step: Visualizing the metrics

Now, all we need to do is figure out how to visualize your metrics.

Grafana is an open-source, general-purpose dashboard and graph composer, which runs as a web application. It supports Graphite, InfluxDB or OpenTSDB as backends. Grafana is arguably the most beautiful dashboarding solution out there.

Setting up Grafana is just as easy as InfluxDB. We will use Docker to do so:

$ docker run -d -p 3000:3000 grafana/grafana:2.6.0

We can now log-in using the always-so-secure admin:admin credentials at http://localhost:3000.

We now need to add our InfluxDB database as a data-source for Grafana. To do so, we click “Data Sources” and then “Add New”. Fill the form based on the picture below:

img

(The InfluxDB credentials are root:root)

With this, you’re set to explore Grafana and create new dashboards. Below are some examples of our production Dashboards:

img

N’th step: Where to go from here

Having followed this post, you now have a complete time-series storage & analysis suite at your disposal. Leverage this tool to create meaningful indicators for your business and make more informed decisions. (Suited up bosses will love to share your graphs in their shiny Prezi presentations)

There is a lot of ground we haven’t covered in the so-called observability field of software engineering. Things like alerting, tracing, log aggregation, error tracking are just as important as application metrics, and you should pursue them too.

Here at Xerpa, we use honeybadger.io and sensu to cover some of that ground. I will probably blog about this in the future.

That’s it.

(Special thanks go to Guilherme Nogueira (@nirev), Hugo Bessa (@hugoBessaa) and George Guimarães (@georgeguimaraes) for their comments and helpful insights)

^1 : This idea is adapted from this post by Alex Garibay

Comments