Grideye on NorduNET and SUNET

April 2015

Cloud monitoring experiment on NorduNET and SUNET in 2015. Some pictures showing some aspects of performance monitoring on NorduNET VMware clusters in Stockholm and Copenhagen, and the same service on a docker on Microsoft Azure.

The graphs can be seen on-line at http://grideye.nordu.net/grideye/.

Note that the azure service is kind of unknown, it is just a free evaluation service, with no special SLA or any contact taken with Microsoft. But still interesting to see what you get from a public cloud.

The dashboard are Grafana and the monitoring application is Grideye based on clicon. The graphs seen in this report are only Grafana, but grideye itself can present results in other ways as well.

The cloud monitoring system consists of an agent which runs the same set of operations within remote VMs, and a controller that communicates with the agents. The agents perform the following operations: I/O read, write and compute. The monitoring also measures the round-trip time from process-to-process (user-space), and reads a couple of counters.

The Grideye configuration for setting up this monitoring session is here, with this configuration manual .

1. Input/Output

The first set of graphs shows I/O latencies and compute of four VMware clusters. Two in Stockholm(se-fre and se-tug) and two in Copenhagen (dk-uni and dk-ore).

The operation is to read and write 3000 bytes to and from a file. The write operation is made on a new file, and the read opeation is made randomly on a very large file, to avoid cache effects.

The graphs shows some differences between the sites. In particular, the dk sites shows a step-function at 07:00 when I/O reads takes longer time (more about this below).

2. Compute

The second set of graphs show compute and CPU load. The 'compute' is made using an (old) metric called Dhrystone. The agent runs 30000 dhrystones on each round.

The graphs shows that 30K dhrystones takes a little more effort on the dk sites than the se, probably because of minor differences in CPU.

The CPU load graph shows a low load at this point in time.

3. Memory

The graphs shows some memory counters as documented in sysinfo. Typically, these graphs may be useful in combination with others when debugging.

4. Round-trip time

The round-trip graphs shows how long it takes for the packets sent by the controller to be returned, minus processing time in the agent. This RTT is to the user-space agent process which means that if the CPU is loaded, the RTT will increase.

It is not network-level RTT. However, on lightly loaded systems, they are similar so can be used in combination with other graphs (eg CPU load) to detect network problems.

The graphs shows that the stockholm sites are close (less than 1ms) from the controller. A magnification shows that se-tug is closest (ca 500us), which is because the controller is also at se-tug.

The copenhagen sites are around 12ms away, while a fifth site, docker-azure, has around 40ms RTT.

The docker-azure site is a grideye-agent compiled as an ubuntu docker image and deployed on Microsoft Azure on a 'north european' site.

5. Docker I/O

The graphs above shows the performance of the Docker agent on Microsoft Azure running the same load as the other agents. It is quite clear that the docker image performs worse, especially in disk write. The variance in the graphs is also noticable.

Again, this comparison is not directly fair since we do not know the details of the Azure server. On the other hand, it is very interesting to see how a well-known service performs on a public cloud.

6. Docker compute

The Dhrystone test is even more damaging for the Docker application. The VMware sites take slightly above 1ms to complete the task while the Docker agent needs around 25ms.

The CPU load is higher as well, but not as much.

7. Clock diffs

Grideye also measures differences in clock offset. This is not really a performance indicator, but shows how well the time synchronization works, between the controller and the agents.

The left graph shows how well NTP works between the controller and the four VMware clusters. It can be seen why NTP as-is usually cannot be used to measure sub-ms latencies.

The right graph shows the Docker Azure offset (around 10s). The reason for this clock difference is that the Docker image containing the agent does not run NTP and therefore its clock drifts. It is an error of the docker build, not Microsoft Azure.

8. Anomalies

This section shows some anomalies in the monitoring graphs.

The first example shows an event that occurs on the 'se' sites every Sunday morning at 3 a.m., and to a lesser extent just after 4 a.m. It is not visible on any other site.

A dramatic increase in I/O latency can be seen on both sites. One can also see an increase in RTT and CPU load at the same time. The CPU load exceeds 100%% on bothe sites (OK on a multi-core).

Most probably a cron job is scheduled at this time (eg backup) which effects the I/O and CPU load. High CPU load also causes higher RTT since the agent process is not scheduled and be activated.

Note that the RTT is very high, on the order of 10s of seconds, and this is average RTT during the interval.

Another example is the periodic increase in I/O read performance of the danish sites, especially 'dk.uni'. The increase in I/O read latency takes place on daytime between 7:00 and 22:00 in a regular fashion. The increase is around 100%%.

A third anomaly is the sudden increase in RTT that took place on March 30 between 10:00 and 11:00 to the Danish sites. For one hour both sites experienced an increase in RTT from ca 12ms up to 150 ms.