Jump to content

Dumps/OtherMisc

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page documents various dumpsets that are produced daily or weekly, not part of the generation of the xml/sql dumps.

All of these dumps run on database servers designated 'vslow, dumps', on a snapshot host dedicated to 'misc' dump generation (everything other than the xml/sql dumps).

The dump scripts are in our git puppet repo.

If errors are encountered when the specific cron job runs, the output is sent to ops-dumps@wikimedia.org.

  • Global block table:
    • dumped weekly
    • contains an sql-format dump of information in the global block table
    • managed by mw:Extension:GlobalBlocking) (code)
    • Issues: Unless the database server goes away during the run, or database credentials change, this job should just run
  • Cirrus search dumps:
    • dumped weekly
    • contains text indices, the file index (for commons) and the metadata index (for the entire cirrus cluster) in json format
    • run by a maintenance script in mw:Extension:CirrusSearch (code)
    • Issues: it's been quite reliable so far
  • Content Translation dumps:
    • dumped weekly
    • contains parallel corpora that can be used by developers working on machine translation.
    • run by a maintenance script in mw:Extension:ContentTranslation (code)
    • Issues: it has run out of memory when the language files being dumped have too much data; these can be split apart in order to resolve the problem. Example: see this phab task.
  • Media info:
    • dumped weekly
    • two files for each wiki, consisting of titles of media files stored locally, and those used on the project stored remotely (on Commons).
    • run by a shell wrapper around the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Page titles:
    • dumped daily
    • contains a list of all page titles in the main namespace (NS 0) per project
    • run by the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Media titles:
    • dumped daily
    • contains a list of all titles in the Media namespace (NS 6) per project
    • run by the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Short url mappings:
    • dumped weekly
    • each line contains an entry of the form short-url|log-url
    • run by the onallwikis.py script in the operations/dumps repo (code)
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.