Jump to content

Incidents/20161021-Maps

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Between 18:50 UTC and 19:20 UTC, October 21st, maps.wikimedia.org stopped rendering tiles due to Cassandra backend being unavailable.

Timeline

  • 18:50 UTC: cassandra wrongly reinitialized on maps2004.codfw.wmnet, deleting all cassandra data on maps2004. Kartotherian starts failing with org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level LOCAL_ONE.
  • 19:20 UTC: traffic redirected to maps eqiad cluster, user traffic is served again without error
  • 19:40 UTC: full deployment of new traffic configuration
  • 21:13 UTC: permissions are reset on maps/cassandra codfw cluster, kartotherian starts working again on the codfw clsuter

Conclusions

  • The main trigger for this is human error.
  • maps/cassandra has a replication factor of 1 on the "system_auth" keyspace. This means that loosing one node potentially breaks authentication.

Actionables

  • increase replication factor on system_auth keyspace task T149074