Jump to content

User:Razzi/Plan to drain hadoop cluster

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

for production, draining cluster: shutting down input disabling camus timers on an-launcher

by disabling, no data flowing in

some jobs like refine are scheduled

Should drain in less than an hour

7-day retention in kafka; kafka used as buffer

now that we have capacity scheduler, you can disable queues

Plan:

  • disable puppet on an-master1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Disable jobs on an-launcher1002
    • sudo systemctl stop 'camus-*'
    • sudo systemctl stop 'drop-*'
    • sudo systemctl stop 'hdfs-*'
    • sudo systemctl stop 'mediawiki-*'
    • sudo systemctl stop 'refine_*'
    • sudo systemctl stop 'refinery-*'
    • sudo systemctl stop 'reportupdater-*'
  • disable queue
    • sudo systemctl stop hadoop-yarn-resourcemanager[1]
  • kill yarn applications
    • for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
  • enable safe mode
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • checkpoint
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • create snapshot tar
    • sudo su
    • cd /srv/hadoop/namenode
    • tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
  • copy snapshot to elsewhere
    • (from my personal computer)
    • scp -3 an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz thorium.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz
      • Based on scp-ing a test file, this will take about 30 minutes; that's acceptable, but if there's a faster way (distcp?) it'd be good to know
  • change uids
  • reimage


stop the cluster

make a backup

change uids

reimage

  1. https://docs.cloudera.com/runtime/7.2.1/yarn-allocate-resources/topics/yarn-start-and-stop-queues.html this is for cloudera ui :P https://stackoverflow.com/questions/42589764/how-to-delete-a-queue-in-yarn