Jump to content

Incidents/20140619-parsercache/IRC log

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

IRC conversion on #wikimedia-operations during the 2014-06-19 pc1002 outage. All times are UTC.

Trimming and annotations made by Antoine "hashar" Musso.

[19:00:19] <logmsgbot>	 !log reedy Finished scap: scap 1.24wmf10 take 2... (duration: 22m 59s)
[19:00:24] <morebots>	 Logged the message, Master
[19:03:05] <grrrit-wm>	 (PS1) Reedy: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 
[19:03:51] <grrrit-wm>	 (CR) Reedy: [C: 2] Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 (owner: Reedy)
[19:04:28] <grrrit-wm>	 (Merged) jenkins-bot: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 (owner: Reedy)
[19:06:27] <logmsgbot>	 !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf9
[19:09:15] <grrrit-wm>	 (PS1) Reedy: group0 to 1.24wmf10 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140770 
# And here PC1002 is reported dieing:
[19:09:51] <icinga-wm>	 PROBLEM - check configured eth on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:01] <icinga-wm>	 PROBLEM - Disk space on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:01] <icinga-wm>	 PROBLEM - check if dhclient is running on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - MySQL disk space on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - mysqld processes on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - DPKG on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - RAID on pc1002 is CRITICAL: Timeout while attempting connection  

# First users report (icinga still beat them by a few seconds yeah!)
[19:10:26] <Excirial>	 Fyi: Just received several 503's on Enwiki.
[19:10:36] <twkozlowski>	 same on Commons, peeps are reporting
[19:11:13] <jorm>	 all wikis.
[19:11:31] <Reedy>	 Sweet

[19:11:41] <icinga-wm>	 PROBLEM - Host pc1002 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:11:48] <Reedy>	 And that'll be why
# Root cause analysis performed by Reedy in less than 30 seconds.

# Then apaches dies...
[19:11:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:11:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:11:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
....

# Debugging time!
[19:12:27] <RobH>	 checking pc1002
[19:12:56] <RobH>	 seems its borked completely and unresponsive to ssh
[19:12:59] <RobH>	 checking mgmt
[19:13:28] <mutante>	 oom death
[19:13:32] <mutante>	 powercycle
[19:13:35] <RobH>	 im rebooting it
# Indeed, not much we can do since the box is dead.

[19:13:36] <ori>	 Reedy: i'd revert
[19:13:38] <ori>	 before it boots up
# If there is something we do properly it is reverting. Proposed less than 2
# minutes after the cause is acked.
[19:14:20] <ori>	 Reedy: it's going to boot with a cold, empty memcached
[19:14:26] <ori>	 if it OOMd there's a good chance it's related to the deploy
[19:14:37] <Reedy>	 isn't it mysql?
[19:14:52] <ori>	 better start populating the cache using known-good mw code

# And reverting...
[19:15:16] <logmsgbot>	 !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias back to 1.24wmf8

[19:15:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time  
[19:15:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time  
[19:15:48] <RobH>	 !loh powercycled pc1002
[19:16:00] <ori>	 coming back up
# It came back because of the revert.  At that point mysql was not running on pc1002.
[19:16:36] <RobH>	 it will be a few minutes

[19:22:10] <Reedy>	 Good to see https://wikitech.wikimedia.org/wiki/Parser_cache is up to date
# points at bug 1. We have no documentation at all for the parser cache.

[19:22:32] <Reedy>	 146 Fatal error: Call to a member function real_escape_string() on a non-object in /usr/local/apache/common-local/php-1.24wmf9/includes/db/DatabaseMysqli.php on line 289

[19:22:41] <icinga-wm>	 RECOVERY - Host pc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms  
[19:22:41] <mutante>	 pc1002 login: 
# pc1002 finally back up


[19:29:46] <RobH>	 its back up
[19:29:51] <RobH>	 well, its responsive to ssh now
[19:29:54] <RobH>	 (im on it)
[19:30:07] <_joe_>	 RobH: is mysql up?
[19:31:04] <RobH>	 nope, wont fire
[19:31:12] <RobH>	 _joe_: if you know about the server, hop on ;]
[19:31:48] <_joe_>	 I don't even know wich one it is
[19:31:52] <ori>	 pc1002
[19:31:56] <_joe_>	 I can try to troubleshoot mysql not starting
[19:31:58] <mutante>	 !log started mysql on pc1002
[19:31:59] <springle>	 wtf
[19:32:01] <icinga-wm>	 RECOVERY - mysqld processes on pc1002 is OK: PROCS OK: 1 process with command name mysqld  
[19:32:03] <springle>	 what happened
[19:32:04] <morebots>	 Logged the message, Master

# Summary for springle (DBA)
[19:32:28] <ori>	 springle: wmf10 to all wikis -> pcs1002 unresponsive -> apache busy workers -> 503s

[19:32:29] <mutante>	 RobH: /etc/init.d/mysql start
[19:32:31] <RobH>	 cuz normal upstart didnt do it
# That should be fixed.

[19:32:43] <ori>	 springle: pcs1002 hard reboot, mysql on pcs1002 did not come up
[19:32:48] <mutante>	 it first said something about the pid file existing
[19:32:49] <ori>	 springle: that's where we're at
[19:32:50] <mutante>	 but it's not set
[19:33:15] <chasemp>	 ori springle then mutante did something not mysql is up there :)
[19:33:19] <mutante>	 springle: 
[19:33:21] <mutante>	 root@pc1002:~# /etc/init.d/mysql status * MySQL is not running, but PID file exists
[19:33:27] <mutante>	 root@pc1002:~# /etc/init.d/mysql status * MySQL is not running

[19:33:33] <bd808>	 ori, springle: looks like it may have started with many "DBError: DB connection error: Too many connections (10.64.16.157)" events
[19:33:35] <springle>	 mysql is recovering
[19:33:35] <mutante>	 root@pc1002:~# /etc/init.d/mysql status * MySQL running (6031)

[19:34:36] <Reedy>	 springle: I told you we needed to replace it with MongoDB
[19:35:23] <springle>	 Reedy: :)


[19:35:44] <ori>	 so,  DB connection error: Too many connections
[19:36:18] <ori>	 on rollout of new release
[19:36:19] <ori>	 is consistent with a run on the parser cache DBs due to massive cache invalidation 
[19:36:31] <Reedy>	 First time we've seen this...
[19:36:35] <Reedy>	 Or at least, in a long time anyway
[19:36:37] <_joe_>	 looking at the logs... mysql server became unresponsive/slow, connection piled up, this blocked workers in apache 
[19:36:56] <_joe_>	 probably this has happened before any failover mechanism kicked in
[19:37:06] <robla>	 Reedy: yeah....we've had at least one parser cache outage that almost went unnoticed, so this is a weird new development
[19:37:08] <bd808>	 sql-bagostuff events in logstash -- https://logstash.wikimedia.org/#dashboard/temp/S9cjGaUCRriwaPpk9xNMCw
....