The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Monitoring Discussion. This meeting was held on 2020-04-11.

Scope

Started discussing around what metrics we need, and are metrics enough
jeh: prometheus node-exporter is installed on all VMs today, which also reports the same data we have in shinken today
andrew: leveraging prod architecture, pros and cons
arturo: we may not want to follow prod a lot
brooke: retention. can we decide in prometheus what to retent more or less?
jeh: can use time or size based storage retention policy
arturo: retention is directly related to storage capacity

modules/prometheus/manifests/server.pp: $storage_retention = '730h', <--- default retention in the prometheus puppet module, what we are using in tools-prometheus (1 month)

bd808: multitenancy, a prometheus instance per project
andrew: does prometheus even support multitenancy?
brooke: somehow yes, by using labels
andrew: security concerns with multitenancy? or only organizational concerns?
brooke: not today, but we need to keep security in mind
bd808: log aggregation is scarier
andrew: central prometheus server vs per project prometheus server
jeh: network scoping, security groups, etc
arturo: prometheus proxy
jeh: push gateway from prometheus server: not very smart for dynamic environments like VMs being created and destroyed
brooke:
- scope1: inmediate need to shutdown shinken. We can shutdown it today and don't loss many
- scope2: centralice & multi tenant servicec
andrew: imagine a cloud project admin wanting a simple grafana dashboard with prometheus metrics.
jeh: a local prometheus server allows for custom, per-project alerts. And then a central grafana
brooke: we apparently are leaning towards prometheus
arturo: replacing shinken with prometheus+alertmanager could be a good experiment before introducing any cloud-wide solution
brooke: alertmanager outgoing alerts? smtp server?
jeh: yes, email + [..] How do we do it with shinken today?
brooke: let's make a task to replace shinken with prometheus+alertmanager. Alert: only for us for now.
brooke: Jason, would you be willing to handle the initial shinken replacement task?
jeh: sure. What about security groups?
andrew: probably update every security group out there. Few projects use shinken. Initial change only in the tools project.
andrew: share info wiht krenair
jeh: we already have a prometheus server in the tools project. Would it make sense to just extend it with alertmanager?
andrew: yeah, why not.
arturo: make this an OKR for proper credits for jason
brooke: maybe search/create an objective to relate all things together. Also, epic phab task https://phabricator.wikimedia.org/T194333
jeh: what do we call it? Use prometheus openstack integration to auto discover VMs (node exporter, puppet alerts on day 1, expand from there).
jeh: initial notifications by email + IRC bots?
brooke: legit!
brooke: next step, replace toolschecker, because shinken couldn't generate pages.
andrew: prometheus in english meas forethinker
arturo: what about monitoring-infra
jeh: create new openstack project, add new prometheus server, update existing server groups (and new project template), configure prometheus openstack-sd-config to scrape vms, configure alert manager to email wmcs-team and notify cloud-feed IRC

brooke: metrics-infra, is shorter!

arturo: shinken project deprecation phab task: https://phabricator.wikimedia.org/T236547
arturo: an epic phab task with many subtasks: https://phabricator.wikimedia.org/T194333