vSphere Monitoring with Grafana

vSphere Monitoring with TIG (Telegraf, InfluxDB, Grafana)

This blog post describes a solution for monitoring SDDC infrastructure using Telegraf, InfluxDB, and Grafana. This solution is based on Docker and displays graphs and metrics via Grafana. All metrics are described in the telegraf.conf file.

Grafana - vSphere Overview

TL;DR: https://github.com/varmox/vsphere-monitoring.git

Prerequiries

  • RHEL based Linux
  • Docker & Docker Compose (or Podman with Docker Compose. In this tutorial we are using docker & docker compose) installed
  • 3 internal IPs for telegraf, grafana and influxdb containers.
  • Access to vSphere API (read-only is sufficient)

Environment Variables

In this tutorial we're using the following varibales, change them according to your setup:

  • subnet 172.29.30.0/23

Filesystem structure:

/srv/tig

├── docker-compose.yml

└── telegraf.conf

Create Telegraf Config

Edit your telegraf.conf file with a editor of choice and paste overwrite your telegraf.conf File. This config skips tls certificate checking so can be used for self-signed certificates.

  1[agent]
  2  ## Default data collection interval for all inputs
  3  interval = "10s"
  4  ## Rounds collection interval to 'interval'
  5  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  6  round_interval = true
  7
  8  ## Telegraf will send metrics to outputs in batches of at most
  9  ## metric_batch_size metrics.
 10  ## This controls the size of writes that Telegraf sends to output plugins.
 11  metric_batch_size = 1000
 12
 13  ## Maximum number of unwritten metrics per output.  Increasing this value
 14  ## allows for longer periods of output downtime without dropping metrics at the
 15  ## cost of higher maximum memory usage.
 16  metric_buffer_limit = 10000
 17
 18  ## Collection jitter is used to jitter the collection by a random amount.
 19  ## Each plugin will sleep for a random time within jitter before collecting.
 20  ## This can be used to avoid many plugins querying things like sysfs at the
 21  ## same time, which can have a measurable effect on the system.
 22  collection_jitter = "0s"
 23
 24  ## Default flushing interval for all outputs. Maximum flush_interval will be
 25  ## flush_interval + flush_jitter
 26  flush_interval = "10s"
 27  ## Jitter the flush interval by a random amount. This is primarily to avoid
 28  ## large write spikes for users running a large number of telegraf instances.
 29  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
 30  flush_jitter = "0s"
 31
 32  ## By default or when set to "0s", precision will be set to the same
 33  ## timestamp order as the collection interval, with the maximum being 1s.
 34  ##   ie, when interval = "10s", precision will be "1s"
 35  ##       when interval = "250ms", precision will be "1ms"
 36  ## Precision will NOT be used for service inputs. It is up to each individual
 37  ## service input to set the timestamp at the appropriate precision.
 38  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
 39  precision = ""
 40
 41  ## Log at debug level.
 42  # debug = false
 43  ## Log only error level messages.
 44  # quiet = false
 45
 46  ## Log target controls the destination for logs and can be one of "file",
 47  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
 48  ## is determined by the "logfile" setting.
 49  # logtarget = "file"
 50
 51  ## Name of the file to be logged to when using the "file" logtarget.  If set to
 52  ## the empty string then logs are written to stderr.
 53  # logfile = ""
 54
 55  ## The logfile will be rotated after the time interval specified.  When set
 56  ## to 0 no time based rotation is performed.  Logs are rotated only when
 57  ## written to, if there is no log activity rotation may be delayed.
 58  # logfile_rotation_interval = "0d"
 59
 60  ## The logfile will be rotated when it becomes larger than the specified
 61  ## size.  When set to 0 no size based rotation is performed.
 62  # logfile_rotation_max_size = "0MB"
 63
 64  ## Maximum number of rotated archives to keep, any older logs are deleted.
 65  ## If set to -1, no archives are removed.
 66  # logfile_rotation_max_archives = 5
 67
 68  ## Pick a timezone to use when logging or type 'local' for local time.
 69  ## Example: America/Chicago
 70  # log_with_timezone = ""
 71
 72  ## Override default hostname, if empty use os.Hostname()
 73  hostname = ""
 74  ## If set to true, do no set the "host" tag in the telegraf agent.
 75  omit_hostname = false
 76[[outputs.influxdb_v2]]
 77  ## The URLs of the InfluxDB cluster nodes.
 78  ##
 79  ## Multiple URLs can be specified for a single cluster, only ONE of the
 80  ## urls will be written to each interval.
 81  ##   ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
 82  urls = ["http://172.29.31.3:8086"]
 83
 84  ## Token for authentication.
 85  token = "sFYmFqizqj8oJyJxuMc7l4PEztLIPOWfe0Aae_QwQiJXA-obkKo_AHCEgXRwMCEXXyYsq6mqRayan3Ylh_Dy0g=="
 86
 87  ## Organization is the name of the organization you wish to write to; must exist.
 88  organization = "sddc"
 89
 90  ## Destination bucket to write into.
 91  bucket = "vsphere"
 92
 93  ## The value of this tag will be used to determine the bucket.  If this
 94  ## tag is not set the 'bucket' option is used as the default.
 95  # bucket_tag = ""
 96
 97  ## If true, the bucket tag will not be added to the metric.
 98  # exclude_bucket_tag = false
 99
100  ## Timeout for HTTP messages.
101  # timeout = "5s"
102
103  ## Additional HTTP headers
104  # http_headers = {"X-Special-Header" = "Special-Value"}
105
106  ## HTTP Proxy override, if unset values the standard proxy environment
107  ## variables are consulted to determine which proxy, if any, should be used.
108  # http_proxy = "http://corporate.proxy:3128"
109
110  ## HTTP User-Agent
111  # user_agent = "telegraf"
112
113  ## Content-Encoding for write request body, can be set to "gzip" to
114  ## compress body or "identity" to apply no encoding.
115  # content_encoding = "gzip"
116
117  ## Enable or disable uint support for writing uints influxdb 2.0.
118  # influx_uint_support = false
119
120  ## Optional TLS Config for use on HTTP connections.
121  # tls_ca = "/etc/telegraf/ca.pem"
122  # tls_cert = "/etc/telegraf/cert.pem"
123  # tls_key = "/etc/telegraf/key.pem"
124  ## Use TLS but skip chain & host verification
125  # insecure_skip_verify = false
126# Read metrics from VMware vCenter
127[[inputs.vsphere]]
128  ## List of vCenter URLs to be monitored. These three lines must be uncommented
129  ## and edited for the plugin to work.
130  vcenters = [ "https://vcenter-fqdn/sdk" ]
131  username = "[email protected]"
132  password = "secret"
133
134  ## VMs
135  ## Typical VM metrics (if omitted or empty, all metrics are collected)
136  # vm_include = [ "/*/vm/**"] # Inventory path to VMs to collect (by default all are collected)
137  # vm_exclude = [] # Inventory paths to exclude
138  vm_metric_include = [
139    "cpu.demand.average",
140    "cpu.idle.summation",
141    "cpu.latency.average",
142    "cpu.readiness.average",
143    "cpu.ready.summation",
144    "cpu.run.summation",
145    "cpu.usagemhz.average",
146    "cpu.used.summation",
147    "cpu.wait.summation",
148    "mem.active.average",
149    "mem.granted.average",
150    "mem.latency.average",
151    "mem.swapin.average",
152    "mem.swapinRate.average",
153    "mem.swapout.average",
154    "mem.swapoutRate.average",
155    "mem.usage.average",
156    "mem.vmmemctl.average",
157    "net.bytesRx.average",
158    "net.bytesTx.average",
159    "net.droppedRx.summation",
160    "net.droppedTx.summation",
161    "net.usage.average",
162    "power.power.average",
163    "virtualDisk.numberReadAveraged.average",
164    "virtualDisk.numberWriteAveraged.average",
165    "virtualDisk.read.average",
166    "virtualDisk.readOIO.latest",
167    "virtualDisk.throughput.usage.average",
168    "virtualDisk.totalReadLatency.average",
169    "virtualDisk.totalWriteLatency.average",
170    "virtualDisk.write.average",
171    "virtualDisk.writeOIO.latest",
172    "sys.uptime.latest",
173  ]
174  # vm_metric_exclude = [] ## Nothing is excluded by default
175  # vm_instances = true ## true by default
176
177  ## Hosts
178  ## Typical host metrics (if omitted or empty, all metrics are collected)
179  # host_include = [ "/*/host/**"] # Inventory path to hosts to collect (by default all are collected)
180  # host_exclude [] # Inventory paths to exclude
181  host_metric_include = [
182    "cpu.coreUtilization.average",
183    "cpu.costop.summation",
184    "cpu.demand.average",
185    "cpu.idle.summation",
186    "cpu.latency.average",
187    "cpu.readiness.average",
188    "cpu.ready.summation",
189    "cpu.swapwait.summation",
190    "cpu.usage.average",
191    "cpu.usagemhz.average",
192    "cpu.used.summation",
193    "cpu.utilization.average",
194    "cpu.wait.summation",
195    "disk.deviceReadLatency.average",
196    "disk.deviceWriteLatency.average",
197    "disk.kernelReadLatency.average",
198    "disk.kernelWriteLatency.average",
199    "disk.numberReadAveraged.average",
200    "disk.numberWriteAveraged.average",
201    "disk.read.average",
202    "disk.totalReadLatency.average",
203    "disk.totalWriteLatency.average",
204    "disk.write.average",
205    "mem.active.average",
206    "mem.latency.average",
207    "mem.state.latest",
208    "mem.swapin.average",
209    "mem.swapinRate.average",
210    "mem.swapout.average",
211    "mem.swapoutRate.average",
212    "mem.totalCapacity.average",
213    "mem.usage.average",
214    "mem.vmmemctl.average",
215    "net.bytesRx.average",
216    "net.bytesTx.average",
217    "net.droppedRx.summation",
218    "net.droppedTx.summation",
219    "net.errorsRx.summation",
220    "net.errorsTx.summation",
221    "net.usage.average",
222    "power.power.average",
223    "storageAdapter.numberReadAveraged.average",
224    "storageAdapter.numberWriteAveraged.average",
225    "storageAdapter.read.average",
226    "storageAdapter.write.average",
227    "sys.uptime.latest",
228  ]
229    ## Collect IP addresses? Valid values are "ipv4" and "ipv6"
230  # ip_addresses = ["ipv6", "ipv4" ]
231
232  # host_metric_exclude = [] ## Nothing excluded by default
233  # host_instances = true ## true by default
234
235
236  ## Clusters
237  # cluster_include = [ "/*/host/**"] # Inventory path to clusters to collect (by default all are collected)
238  # cluster_exclude = [] # Inventory paths to exclude
239  # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
240  # cluster_metric_exclude = [] ## Nothing excluded by default
241  # cluster_instances = false ## false by default
242
243  ## Datastores
244  # datastore_include = [ "/*/datastore/**"] # Inventory path to datastores to collect (by default all are collected)
245  # datastore_exclude = [] # Inventory paths to exclude
246  # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
247  # datastore_metric_exclude = [] ## Nothing excluded by default
248  # datastore_instances = false ## false by default
249
250  ## Datacenters
251  # datacenter_include = [ "/*/host/**"] # Inventory path to clusters to collect (by default all are collected)
252  # datacenter_exclude = [] # Inventory paths to exclude
253  datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
254  datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
255  # datacenter_instances = false ## false by default
256
257  ## Plugin Settings
258  ## separator character to use for measurement and field names (default: "_")
259  # separator = "_"
260
261  ## number of objects to retrieve per query for realtime resources (vms and hosts)
262  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
263  # max_query_objects = 256
264
265  ## number of metrics to retrieve per query for non-realtime resources (clusters and datastores)
266  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
267  # max_query_metrics = 256
268
269  ## number of go routines to use for collection and discovery of objects and metrics
270  # collect_concurrency = 1
271  # discover_concurrency = 1
272
273  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
274  # object_discovery_interval = "300s"
275
276  ## timeout applies to any of the api request made to vcenter
277  # timeout = "60s"
278
279  ## When set to true, all samples are sent as integers. This makes the output
280  ## data types backwards compatible with Telegraf 1.9 or lower. Normally all
281  ## samples from vCenter, with the exception of percentages, are integer
282  ## values, but under some conditions, some averaging takes place internally in
283  ## the plugin. Setting this flag to "false" will send values as floats to
284  ## preserve the full precision when averaging takes place.
285  # use_int_samples = true
286
287  ## Custom attributes from vCenter can be very useful for queries in order to slice the
288  ## metrics along different dimension and for forming ad-hoc relationships. They are disabled
289  ## by default, since they can add a considerable amount of tags to the resulting metrics. To
290  ## enable, simply set custom_attribute_exclude to [] (empty set) and use custom_attribute_include
291  ## to select the attributes you want to include.
292  ## By default, since they can add a considerable amount of tags to the resulting metrics. To
293  ## enable, simply set custom_attribute_exclude to [] (empty set) and use custom_attribute_include
294  ## to select the attributes you want to include.
295  # custom_attribute_include = []
296  # custom_attribute_exclude = ["*"]
297
298  ## The number of vSphere 5 minute metric collection cycles to look back for non-realtime metrics. In
299  ## some versions (6.7, 7.0 and possible more), certain metrics, such as cluster metrics, may be reported
300  ## with a significant delay (>30min). If this happens, try increasing this number. Please note that increasing
301  ## it too much may cause performance issues.
302  # metric_lookback = 3
303
304  ## Optional SSL Config
305  # ssl_ca = "/path/to/cafile"
306  # ssl_cert = "/path/to/certfile"
307  # ssl_key = "/path/to/keyfile"
308  ## Use SSL but skip chain & host verification
309  insecure_skip_verify = true
310
311  ## The Historical Interval value must match EXACTLY the interval in the daily
312  # "Interval Duration" found on the VCenter server under Configure > General > Statistics > Statistic intervals
313  # historical_interval = "5m"

Container Setup

Use the following docker compose file to start your containers. This solution uses ipvlan to assign each container an IPv4 Address. You don't have to assign the IPs to your OSes NIC.

Change the passwords and secrets to your needs.

 1version: '3.6'
 2
 3networks:
 4  ipvlan0:
 5    driver: ipvlan
 6    driver_opts:
 7      parent: ens192
 8
 9
10services:
11  telegraf:
12    image: telegraf
13    container_name: telegraf
14    restart: always
15    volumes:
16    - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
17    depends_on:
18      - influxdb
19    links:
20      - influxdb
21    ports:
22    - '8125:8125'
23    networks:
24      ipvlan0:
25        ipv4_address: 172.29.31.2
26
27  influxdb:
28    image: influxdb:2.6-alpine
29    container_name: influxdb
30    restart: always
31    environment:
32      - INFLUXDB_DB=influx
33      - INFLUXDB_ADMIN_USER=admin
34      - INFLUXDB_ADMIN_PASSWORD=admin
35    networks:
36      ipvlan0:
37        ipv4_address: 172.29.31.3
38
39    ports:
40      - '8086:8086'
41    volumes:
42      - influxdb_data:/var/lib/influxdb
43
44  grafana:
45    image: grafana/grafana
46    container_name: grafana-server
47    restart: always
48    depends_on:
49      - influxdb
50    environment:
51      - GF_SECURITY_ADMIN_USER=admin
52      - GF_SECURITY_ADMIN_PASSWORD=admin
53      - GF_INSTALL_PLUGINS=
54    links:
55      - influxdb
56    networks:
57      ipvlan0:
58        ipv4_address: 172.29.31.4
59    ports:
60      - '3000:3000'
61    volumes:
62      - grafana_data:/var/lib/grafana
63
64volumes:
65  grafana_data: {}
66  influxdb_data: {}

Start Containers

In the directory where the docker compose file is located:

1docker compose up -d

Check Containers

1docker ps -a

The containers should all have a running state.

Create TIG Config

Head to your influxDB container ip with port 8086. In my case it is '172.29.31.3:8086' With help of the webinterface create your buckets according to the naming in your "telegraf.conf" file.

Import Grafana Dashboards

Access your grafana dashboard at your IP with port 3000 (in my case https://172.29.31.4:3000)

The grafana dashboards can be access via github: https://github.com/jorgedlcruz/vmware-grafana/tree/master (Cudos to Jorge de la Cruz)

Import the grafana dashboards:

  • Click Dashboards in the left-side menu.
  • Click New and select Import in the dropdown menu.
  • Paste dashboard JSON text directly into the text area

https://jorgedelacruz.uk/