Counting bytes and measuring latencies

Units of elasticity, thinking beyond autoscaling

Most of the infrastructure elasticity related resources cite autoscaling as an example. While this is good to begin with, sometime i fin folks think of individual node or server as the basic unit of infrastructure elasticity, which can be increased or decreased to address scaling issues or better resource usage. But this is not true. Depending upon the technology you use, you can actually control resources at even lower level. You can control the memory, cpu usage, number of processes and many other kernel parameter in openvz or lxc. In lxc world its done via cgroups (a recent feature in kernel) while in openvz world you do it using user bean counters. You can change these paramteres without a a system restart. In host machines its done via sysctl.

What it means that you can actually profile your app  in staging environments against certain stress as the CI goes on, without any dedicated performance testing step; and then feedback it in the IaaS solution to determine the apropriate values for those parameters, and use some pessimistic settings as upper limits. Now, any performance bug (spanning across your app, your webserver or any other component thats deployed in the container) will most likely surface as a leak, and you can catch it, alert it , or might even break the build.

A more detailed example would be, monitoring number of processes via nagios, and feeding it to graphite (with graphios sitting in between), and then applying a moving average function (using graphite) on the values. After first 10 run you invoke system that will set the total allowed processes to the moving average plus some tolerance values. Now, if the process counts goes above the moving average  you'll get alert (like in openvz check for failcount), and then another nagios event handler resets the  threshold to another suitable value.


Since, you can do this for memory, openfile descriptors , jvm based system(using jmx and mbean counters) you can actually do preemptive performance testing. From what i have seeen, performance testing is always a last to do thing. Though in principle in preaching we say it should go hand in hand, but business demands the features more than performance unless its a show stopper. But now, that need not be the case. Even if we can inhibit a few of the bugs,, its worth it... 

let the systems emerge