Jump to content

Data Platform/Data Lake/Edits/MediaWiki history/Revision identity reverts

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page summarizes an interaction on Slack about the definition of the notion of revision identity-revert in the mediawiki_history dataset. This notion is at heart of 4 fields of the dataset: revision_is_identity_revert, revision_is_identity_reverted, revision_seconds_to_identity_revert and revision_first_identity_reverting_revision_id.

Definition

A revision RR is an identity-revert of a previous revision if they have the same content-hash as computed by mediawiki. This content-hash is available in the mediawiki revision table, field rev_sha1 (see https://www.mediawiki.org/wiki/Manual:Revision_table). In this case in the mediawiki_history dataset the revision RR will have the revision_is_identity_revert field set to true. It is interesting to notice that multiple revisions can be reverts to a single original revision, the first one in time for the revision page to have the content-hash reused by following revisions.

The second aspect covered by the identity-revert is the one of reverted revisions. A revision RD is flagged as reverted (revision_is_identity_reverted set to true) if it happens between two revisions (RO, RR) where RR is an identity-revert to RO (they share the same content-hash). In term of content, a reverted revision is a revision whose content has been removed in favor of a previously existing content. It is classical to see multiple revisions being reverted by a a single revert. To help with tracking analyzing revert patterns, we have added two fields referencing the first reverting revision of a reverted revision: revision_first_identity_reverting_revision_id and revision_seconds_to_identity_revert.

Relation to Mediawiki Revert Types

Mediawiki defines 3 revert types: https://www.mediawiki.org/wiki/Manual:Reverts#Revert_types. Those revert types can be identified in the mediawiki_history dataset using the revision_tags field.

On July 11th 2022 Isaac Johnson (Thank you so much Isaac :) ran an analysis on May 2022 data of English Wikipedia (including namespace 0 only revisions, and excluding redirects):

+----------+----+--------+------+--------+
|mw_history|undo|rollback|manual|# edits |
+----------+----+--------+------+--------+
|false     |0   |0       |0     |3100382 |  # edits that weren't reverts by any definition

# reverts captured in mediawiki history
# vast majority also show up as edit tags
# but some small proportion were manual and
# missed by mediawiki for some reason
|true      |1   |0       |0     |109548  |  # straightforward undos
|true      |0   |1       |0     |54032   |  # straightforward rollbacks
|true      |0   |0       |1     |49750   |  # straightforward manual reverts
|true      |0   |0       |0     |20827   |  # manual reverts found by mediawiki_history table but not identified by mediawiki

# reverts not captured in mediawiki history
# vast majority are undos where presumably the editor
# changed something too and so generated a unique hash
|false     |1   |0       |0     |7711    |  # undo
|false     |0   |0       |1     |8       |  # manual
|false     |0   |1       |0     |2       |  # rollback
+----------+----+--------+------+--------+

For reference, here is the code of the analysis:

WITH reverts AS (
    SELECT
      revision_is_identity_revert as mw_history,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-undo'), 1, 0) AS undo,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-rollback'), 1, 0) AS rollback,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-manual-revert'), 1, 0) as manual
    FROM wmf.mediawiki_history
    WHERE
      snapshot = '2022-05'
      AND wiki_db = 'enwiki'
      AND page_namespace = 0
      AND NOT page_is_redirect
      AND event_type = 'create'
      AND event_entity = 'revision'
      AND event_timestamp >= '2022-05-01'
      AND event_timestamp < '2022-06-01'  
)
SELECT
  mw_history,
  undo,
  rollback,
  manual,
  COUNT(1)
FROM reverts
GROUP BY
  mw_history,
  undo,
  rollback,
  manual