Jump to content

Obsolete:Eqiad Migration Planning

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Coordination

Outstanding Server/System Readiness

  • App, Imagescalers, Bits, Jobrunners and API Apaches
    • All Ready - awaiting code deploy
  • Parsoid servers@Eqiad
    • Target - 1/11/13 (RobH)
  • Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
    • 2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
    • holding off adding more as to not disrupt swift->ceph replication speed
    • swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
    • some stability issues - close cooperation with Ceph developers, being fixed realtime
    • h310 perc issue - workaround with raid 0
    • 0.56 has been released and deployed to the eqiad cluster
    • various other hiccups, both hardware & software related
    • still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki
  • Database Master switchover (PY / Asher)
    • MHA
    • https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
      • Automated DB/Apache switchcover script
        • Tampa - Read-only
        • Eqiad - Grants needed
        • See "Actually Failing Over" below.
      • varnish configuration switchover script - Mark

Software / Config Requirements


  • replicating the git checkouts, etc. to new /home
    • not an issue

Actually Failing Over

  • Sequence (-AI Asher)
    • deploy db.php with all shards set to read-only in both pmtpa and eqiad
    • redis failover - setting mc1001-1016 as masters, mc1-16 slaving from eqiad
    • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
      • start with read-only mode
      • try to bypass puppet / must be within 1 minute or 2
    • database warmup - scripting select query collection for every project, and warmup of all eqiad dbs
    • master swap every core db and writable es shard to eqiad
    • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
      • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
    • No DNS or Ceph/Swift changes required
    • Rollback plan - needs to add details
    • turn off multi-write to NAS & turn on multi-write to Ceph
    • TEST! TEST! TEST!

Deployment- D-day

  • Day minus 1 (1/21/13) preparation Work
    • Automated test run
    • determine if deploying bits early is a possibility
  • D-Day 1/22/13
    • see actual failover paragrah above
  • D-day + 1 1/23/13

Risk & Mitigation

Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.

  • What could cause fallback to Tampa a big problem should migration failed?
    • should Ceph fail?
    • should Swift@Tampa fail?
    • Database integrity
    • Performance
  • Need to determine Switchback Threshold - ??

Improving Switchover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012 [see below]. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
    • migrate ns1 from tampa to ashburn, but not a critical item.
  • An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).


  • create/doc CheckList - PY/ChrisM

AI - a automated test scripts - ChrisM

Use Cases - Tests

  • Developer
    • Check-in/out codes
    • code review
    • Code push/deploy
    • revert deployment
  • User
    • registers
    • search article
    • read article
    • comment on article
    • edit article
    • create article
    • localization
  • Community member
    • tag article
    • (exercise special pages features)
  • Ops
    • monitoring works - ganglia, nagios, torrus, .....
    • check amanda backups