Cloud monitoring experiment on NorduNET and SUNET in 2015. Some pictures showing some aspects of performance monitoring on NorduNET VMware clusters in Stockholm and Copenhagen, and the same service on a docker on Microsoft Azure.
The graphs can be seen on-line at http://grideye.nordu.net/grideye/.
Note that the azure service is kind of unknown, it is just a free evaluation service, with no special SLA or any contact taken with Microsoft. But still interesting to see what you get from a public cloud.
The dashboard are Grafana and the monitoring application is Grideye based on clicon. The graphs seen in this report are only Grafana, but grideye itself can present results in other ways as well.
The cloud monitoring system consists of an agent which runs the same set of operations within remote VMs, and a controller that communicates with the agents. The agents perform the following operations: I/O read, write and compute. The monitoring also measures the round-trip time from process-to-process (user-space), and reads a couple of counters.
The Grideye configuration for setting up this monitoring session is here, with this configuration manual .
The first set of graphs shows I/O latencies and compute of four VMware clusters. Two in Stockholm(se-fre and se-tug) and two in Copenhagen (dk-uni and dk-ore).
The operation is to read and write 3000 bytes to and from a file. The write operation is made on a new file, and the read opeation is made randomly on a very large file, to avoid cache effects.
The graphs shows some differences between the sites. In particular, the dk sites shows a step-function at 07:00 when I/O reads takes longer time (more about this below).
The graphs shows that 30K dhrystones takes a little more effort on the dk sites than the se, probably because of minor differences in CPU.
The CPU load graph shows a low load at this point in time.
|
|
It is not network-level RTT. However, on lightly loaded systems, they are similar so can be used in combination with other graphs (eg CPU load) to detect network problems.
The graphs shows that the stockholm sites are close (less than 1ms) from the controller. A magnification shows that se-tug is closest (ca 500us), which is because the controller is also at se-tug.
The copenhagen sites are around 12ms away, while a fifth site, docker-azure, has around 40ms RTT.
The docker-azure site is a grideye-agent compiled as an ubuntu docker image and deployed on Microsoft Azure on a 'north european' site.
|
Again, this comparison is not directly fair since we do not know the details of the Azure server. On the other hand, it is very interesting to see how a well-known service performs on a public cloud.
|
The CPU load is higher as well, but not as much.
|
The left graph shows how well NTP works between the controller and the four VMware clusters. It can be seen why NTP as-is usually cannot be used to measure sub-ms latencies.
The right graph shows the Docker Azure offset (around 10s). The reason for this clock difference is that the Docker image containing the agent does not run NTP and therefore its clock drifts. It is an error of the docker build, not Microsoft Azure.
|
A dramatic increase in I/O latency can be seen on both sites. One can also see an increase in RTT and CPU load at the same time. The CPU load exceeds 100%% on bothe sites (OK on a multi-core).
Most probably a cron job is scheduled at this time (eg backup) which effects the I/O and CPU load. High CPU load also causes higher RTT since the agent process is not scheduled and be activated.
Note that the RTT is very high, on the order of 10s of seconds, and this is average RTT during the interval.
|
|