Jump to content

User:JMeybohm (WMF)/Docker-Registry-Stresstest

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Potential bottlenecks

  • Swift is active/active in both DCs
  • Registry is only active in codfw
  • We do not have different endpoints for docker-rw and docker-ro registry
  • We could potentially wait for the images to be replicated (swift wise) after pushing from CI
  • Swift <-> docker-registry: Probably fine, better if we could read DC local
  • docker-registry <-> docker clients: Potentially bad, 1GbE shared link on ganeti, used to pull image from Swift as well.

Actions?

  • We could potentially have one docker-registry per rack row, so docker-registry traffic would not leave rows (as scap proxy does)
  • Create a read-only docker registry discovery record that points to both DCs
  • Can we cache (more/at all) with nginx on the docker-registry nodes? DONE
  • What about client_body_buffer_size while pushing?


Tests

I'm running 3 sequential image pulls on each k8s node in codfw (see pulltiming.sh below), going from 1 host to 19 hosts in parallel.

Network per registry node (local nginx cache & dragonfly) 73 nodes
Network per registry node (local nginx cache & dragonfly)
Network per registry node (without local nginx cache)
Network per registry node (with local nginx cache)

Test steps

# cumin1001
HOSTS="kubernetes[2001-2017].codfw.wmnet,kubestage[2001-2002].codfw.wmnet"
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock clush -v -w $HOSTS --copy /home/jayme/pulltiming.sh --dest /home/jayme/

for n in $(seq 1 19); do hosts=$(nodeset --pick=$n -f $HOSTS); sudo cumin --force $hosts "/home/jayme/pulltiming.sh p${n}"; done

# Grab results from nodes
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock clush -v -w $HOSTS --rcopy /home/jayme/pulltiming --dest /home/jayme/pulltiming/
sudo chown jayme:wikidev -R pulltiming


# local
ssh cumin1001.eqiad.wmnet "tar cfz - pulltiming" | tar xfz -

pulltiming.sh

#!/bin/bash

REPO=docker-registry.discovery.wmnet
IMAGE=restricted/mediawiki-multiversion
TAG=2021-05-14-185433-publish

test_name=$1
iterations=${2:-3}
repo_uri="https://${REPO}"
# Craft the AuthConfig object needed to authenticate to docker-registry.discovery.wmnet
config_json=/var/lib/kubelet/config.json
if sudo test -r "/root/.docker/config.json"; then
    config_json=/root/.docker/config.json
fi
basicauth=$(sudo cat ${config_json} | jq -r ".auths.\"${repo_uri}\".auth" | base64 -d)
if [ -z "$basicauth" ]; then
    echo "Credentials for docker registry not found, aborting"
    exit 1
fi
arr=(${basicauth//:/ })
auth=$(echo -n "{\"username\": \"${arr[0]}\",\"password\": \"${arr[1]}\",\"serveraddress\": \"${repo_uri}\"}" | base64 -w 0)

cd /home/jayme
mkdir -p ./pulltiming/

if [ -n "$test_name" ]; then
    outfile_base="${HOSTNAME}_${test_name}_$(date +%s)"
else
    outfile_base="${HOSTNAME}_$(date +%s)"
fi
for idx in $(seq 3); do
    outfile="${outfile_base}_${idx}"
    sudo docker rmi "${REPO}/${IMAGE}:${TAG}" > /dev/null 2>&1
    sudo curl -s --unix-socket /var/run/docker.sock -XPOST \
        -d "fromImage=${REPO}/${IMAGE}&tag=${TAG}" \
        -H "X-Registry-Auth: ${auth}" \
        'http://docker/v1.18/images/create' | \
            jq -c --unbuffered '. + {time: now}' > "./pulltiming/${outfile}.json"
done

chown jayme:wikidev -R ./pulltiming/

Script to parse/process the data (pulltiming.py)

https://phabricator.wikimedia.org/P15954