Website Logo. Upload to /source/logo.png ; disable in /source/_includes/logo.html

Zuzur’s wobleg

technology, Internet, and a bit of this and that (with some dyslexia inside)

Metrics + Grails = Awesomeness

| Comments

I have spent the last 5 years working as a devops engineer/operations manager, and i have been very fustrated at times by a lot of java applications which didn’t provide an easy way for operators to plug them inside their monitoring infrastructure.

I mean, have you already tried to monitor the number of requests / sec an application running in a JBoss/Jetty/Tomcat container will handle ? On one side, you have simple tools that can easily parse JSon documents (for instance) or send HTTP requests, and on the other side, you enter the realm of JMX, which is as “simple” as the “Simple” Network Management Protocol is…

There is simply no way you can collect statistics/healthchecks from a java application without installing at least a Java VM and decipher the very complex settings of the JMX console.

There are very nice solutions to that (yes, i’m talking about you, Newrelic’s RPM) but they aren’t cheap. I know there are good command-line alternatives, but they all require the JVM installed, and they all are based on that damn JMX protocol…

The consequences are pretty darning. I have yet to meet a java developer who will pro-actively implement that kind of monitoring tools inside her application, if not only plan it for the future…

When i’m looking at an application as an operator, my ultimate goal is usually to integrate it into some chef or deployment tool recipes (capistrano, etc…), and in a complex platform with dozens of servers, load balancers and so on.

So I would like this application to behave properly, and in a more than ideal world, what I would find would be :

  1. operation manuals how to setup and configure the application
  2. start and stop scripts
  3. monitoring entry points
  4. healthchecks for load balancers and monitoring tools
  5. latches for maintenance: easy way to make a web app return 503 errors on purpose (maintenance mode). For instance, to remove it from a load balancer and update its configuration while the rest of the servers are happily serving requests)

You often find 1 & 2, but i’ve rarely found any of last 3 items in that list… Have you ever seen all of that in your standard, out-of-the-box Java application ?

To my knowledge, only one Java framework has most of that embedded right in its core : Dropwizard. Check it out. That fat jar idea: including everything the application needs into one big jar file, including the servlet container, and use an external configuration file in YML is a dream come true for an operator…

Enter metrics

From the same nice guys who wrote DropWizard comes a very fine library, metrics, which plugs easily in any java application and provides annotations to add meters, gauges, histograms, etc… to any method in your application. You can also easily write simple health checks.

The library also comes with a servlet that allows anyone (so you have to take care of controlling access to that servlet in the container) to access the recorded metrics as simple JSON documents, and a simple page which will return a 500 HTTP return code if one of the registered health checks doesn’t pass.

Grails + metrics

When i set out about adding metrics to that prototype of the “perfect java application” I am currently working on, i found out there is a plugin for that (yes ! Grails is the iPhone of the web frameworks, “There’s an app for that” :-))

install the yammer-metrics plugin

Add the yammer-metrics plugin to the project’s BuildConfig.groovy

BuildConfig.groovy
1
2
3
4
5
    plugins {
        //...
        compile ":yammer-metrics:3.0.1-2"
        //...
    }

After that, you can add @Metered and @Timed annotations to any method in your project, and metrics will start collecting profiling informations about the methods you instrument…

Controller.groovy
1
2
3
4
5
6
7
8
9
10
11
//...
    @Metered
    @Timed
    def index(Integer max) {
      params.max = Math.min(max ?: 15, 100)
      respond DataSet.list(params), model: [dataSetInstanceCount: DataSet.count()]
    }

    def show(DataSet dataSetInstance, Integer max) {
      params.max = Math.min(max ?: 15, 100)
//...

monitoring the JVM

Monitoring the VM (particularly the Garbage Collection) process can help understand performance issues (i’ve seen applications getting 10x performance boost just by working on decreasing the number of times a new String instance was created, and using StringBuffer instances instead…)

The metrics plugin doesn’t provide this per default, but it’s quite easy to add one of the existing metrics modules or write your own…

Add a dependency in BuildConfig.groovy

BuildConfig.groovy
1
2
3
4
5
    dependencies {
      //...
      compile 'com.codahale.metrics:metrics-jvm:3.0.1'
      //...
    }

Then you just need to update BootStrap.groovy:

BootStrap.groovy
1
2
3
4
5
6
      // Instrument the JVM

      Metrics.getRegistry().register("jvm.buffers", new BufferPoolMetricSet(ManagementFactory.getPlatformMBeanServer()));
      Metrics.getRegistry().register("jvm.gc", new GarbageCollectorMetricSet());
      Metrics.getRegistry().register("jvm.memory", new MemoryUsageGaugeSet());
      Metrics.getRegistry().register("jvm.threads", new ThreadStatesGaugeSet());

adding healthchecks

say you’d like to setup an alert if the application storage’s goes below a certain limit (I know there are many different ways to do that, but i really like this one because it’s integrated right in the application that needs that storage !)

StorageHealthCheck.groovy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
    class StorageHealthCheck extends HealthCheck {

      private final File storagePath
      private final long    minimumSpace // minimum space, default is 1GB

      public static final long DEFAULT_FREE_SPACE = 10e8 / 2 // 1GB

      StorageHealthCheck(String storagePath,long minimumSpace = DEFAULT_FREE_SPACE) {
        this.storagePath = new File(storagePath)
        this.minimumSpace = minimumSpace
      }


      @Override
      public HealthCheck.Result check() throws Exception {
        if (storagePath.getFreeSpace() > minimumSpace) {
          return HealthCheck.Result.healthy()
        } else {
          return HealthCheck.Result.unhealthy("Free space below ${minimumSpace/10e8} GB !")
        }
      }
    }

Then you just need to register that new health check into metrics’ registry. Update BootStrap.groovy (or somewhere down the chain of your application initialization code):

BootStrap.groovy
1
2
3
    log.info("Setting up StorageService healthcheck")
    def minimumFreeSpace = grailsApplication.config.storage?.minimumSpace ?: StorageHealthCheck.DEFAULT_FREE_SPACE
    HealthChecks.register(StorageService.name,new StorageHealthCheck(storageBase.getPath(),minimumFreeSpace))

You’ll want to add all sorts of health checks for every services on which your application depends for working properly. Database, work queue server (what about regularly checking that a message is properly received after having been sent to a specific work queue ?).

maintenance latches

One health check that i find especially important is a very simple one.

It just checks for the presence of a file somewhere on the file system and starts to fail if the file is present…

The application is put in maintenance mode the second an operator runs touch /tmp/myApplicationMaintenance ! very handy

self test

We can imagine that before performing anything, an application would check its own health. But this has implications… You don’t want to perform a full-blown list of healthchecks with every requests your web application receives, and this could lead to re-entrance issues.

metrics servlet

The plugin setups a /metrics endpoint which displays a very simple webpage (it’s not meant for lusers !) with a few interesting:

  • JSON dump of every meters/timers collected
  • thread dump of your application (kill -QUIT without having to access the server)
  • healthchecks : calls every healthchecks registered and returns 200 if everything’s ok, and 500 if not

Results, at last !

Now i can just monitor servers with curl or any scripting based monitoring tool (nagios, icinga,…) without having to mess with java on my monitoring infrastructure !

temp.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
curl -v http://localhost:8090/app/metrics/metrics?pretty=true
* Adding handle: conn: 0x7fc059803a00
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fc059803a00) send_pipe: 1, recv_pipe: 0
* About to connect() to localhost port 8090 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8090 (#0)
> GET /app/metrics/metrics?pretty=true HTTP/1.1
> User-Agent: curl/7.30.0
> Host: localhost:8090
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Cache-Control: must-revalidate,no-cache,no-store
< Content-Type: application/json
< Transfer-Encoding: chunked
< Date: Sun, 28 Dec 2014 17:18:48 GMT
<
...
"meters" : {
  "RabbitConsumer.handleMessageMeter" : {
    "count" : 85,
    "m15_rate" : 0.7732454500737989,
    "m1_rate" : 0.08930519529809891,
    "m5_rate" : 0.24577069821886505,
    "mean_rate" : 0.6213198221855765,
    "units" : "events/second"
    },
    ...
    "timers" : {
      "RabbitConsumer.handleMessageTimer" : {
        "count" : 84,
        "max" : 5.312139,
        "mean" : 0.9895226904761905,
        "min" : 0.156007,
        "p50" : 0.731279,
        "p75" : 1.250077,
        "p95" : 2.82703675,
        "p98" : 5.224355500000001,
        "p99" : 5.312139,
        "p999" : 5.312139,
        "stddev" : 0.9116050657129663,
        "m15_rate" : 0.7690302373777833,
        "m1_rate" : 0.08829140595516632,
        "m5_rate" : 0.2432387618368667,
        "mean_rate" : 0.6140014208723833,
        "duration_units" : "seconds",
        "rate_units" : "calls/second"
        },
      }
...

And the most important: i can easily monitor the business-critical aspects of my application and detect when a change or deployment induced a problem in those figures.

Clearer information leads to better, informed decisions…

I think this plugin should be part of the Grails distribution. Maybe there would be licensing issues between SpringSource and Codahale, but having that right in the framework would be just plain awesome !

Comments