Jump to content

Portal:Cloud VPS/Admin/notes/Monitoring

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Monitoring Discussion. This meeting was held on 2020-04-11.

Scope

  • Started discussing around what metrics we need, and are metrics enough
  • jeh: prometheus node-exporter is installed on all VMs today, which also reports the same data we have in shinken today
  • andrew: leveraging prod architecture, pros and cons
  • arturo: we may not want to follow prod a lot
  • brooke: retention. can we decide in prometheus what to retent more or less?
  • jeh: can use time or size based storage retention policy
  • arturo: retention is directly related to storage capacity

modules/prometheus/manifests/server.pp: $storage_retention = '730h', <--- default retention in the prometheus puppet module, what we are using in tools-prometheus (1 month)

  • bd808: multitenancy, a prometheus instance per project
  • andrew: does prometheus even support multitenancy?
  • brooke: somehow yes, by using labels
  • andrew: security concerns with multitenancy? or only organizational concerns?
  • brooke: not today, but we need to keep security in mind
  • bd808: log aggregation is scarier
  • andrew: central prometheus server vs per project prometheus server
  • jeh: network scoping, security groups, etc
  • arturo: prometheus proxy
  • jeh: push gateway from prometheus server: not very smart for dynamic environments like VMs being created and destroyed
  • brooke:
    • scope1: inmediate need to shutdown shinken. We can shutdown it today and don't loss many
    • scope2: centralice & multi tenant servicec
  • andrew: imagine a cloud project admin wanting a simple grafana dashboard with prometheus metrics.
  • jeh: a local prometheus server allows for custom, per-project alerts. And then a central grafana
  • brooke: we apparently are leaning towards prometheus
  • arturo: replacing shinken with prometheus+alertmanager could be a good experiment before introducing any cloud-wide solution
  • brooke: alertmanager outgoing alerts? smtp server?
  • jeh: yes, email + [..] How do we do it with shinken today?
  • brooke: let's make a task to replace shinken with prometheus+alertmanager. Alert: only for us for now.
  • brooke: Jason, would you be willing to handle the initial shinken replacement task?
  • jeh: sure. What about security groups?
  • andrew: probably update every security group out there. Few projects use shinken. Initial change only in the tools project.
  • andrew: share info wiht krenair
  • jeh: we already have a prometheus server in the tools project. Would it make sense to just extend it with alertmanager?
  • andrew: yeah, why not.
  • arturo: make this an OKR for proper credits for jason
  • brooke: maybe search/create an objective to relate all things together. Also, epic phab task https://phabricator.wikimedia.org/T194333
  • jeh: what do we call it? Use prometheus openstack integration to auto discover VMs (node exporter, puppet alerts on day 1, expand from there).
  • jeh: initial notifications by email + IRC bots?
  • brooke: legit!
  • brooke: next step, replace toolschecker, because shinken couldn't generate pages.
  • andrew: prometheus in english meas forethinker
  • arturo: what about monitoring-infra
  • jeh: create new openstack project, add new prometheus server, update existing server groups (and new project template), configure prometheus openstack-sd-config to scrape vms, configure alert manager to email wmcs-team and notify cloud-feed IRC
  • brooke: metrics-infra, is shorter!