Jump to content

Data Platform/Data Lake/Traffic/Pageview hourly/Sanitization algorithm proposal

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

In a plan to protect our user's data privacy, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path.

See this page for a broader view on the pageview_hourly sanitization project.

Algorithm using COUNT(DISTINCT ip) OR COUNT(DISTINCT page_title) as trigger

Plain English

Three input parameters:

  1. Ks, the two numbers of distinct IPs and distinct page_titles below which groups get anonymized
  2. dimensions, the list of fields on which to group and possibly anonymize
  3. dataset, an hour of rows of the pageview_hourly table with IPs

The first computation step is to generate a statistics table that will be reused to decide which value to anonymize when needed. This statistic table is a Map with key (field name, field value) and values (count of distinct IPs, count of distinct page_titles). For example: ("city", "New York") -> (203954, 3258796), ("os_family", "Android") -> (874645, 1257645), ....

Then the anonymization process loops starts:

  1. group the dataset by the dimensions, and count distinct IPs and page_titles for each group.
  2. If no group have their count of distinct IPs or page_titles lower than Kip / Kpv, anonymization is finished, return the dataset.
  3. Else, for every group having their count of distinct IPs/pageviews lower than Kip/Kpv, choose the pair (field, value) having the smaller statistics value (for ip or page_title) and anonymize that field for all the group's rows.

Pseudo-code

function build_statistics(dimensions: Set[String], dataset: List[Rows]) returns Map((String, String), (Long, Long)):
  var statistics_table = new Map
  for (field in dimensions):
    for (value, d_ips, d_pvs) in dataset.groupBy(field).agg(count(distinct ip) as d_ips, count(distinct page_title) as d_pvs):
      statistics_table[(field, value)] = (d_ips, d_pvs)
  return statistics_table

function getFieldToAnonymize(row: Row, d_idx: Int, dimensions: Set[String], statistics_table: (String, String), Long) returns String:
  var result_field = ""
  var min_d = -1
  for (field in dimensions):
      var field_d = statistics_table[(field, row[field])].get(d_idx)
    if ((min_d == -1) || (field_d < min_d)):
      result_field = field
      min_d = field_d
  return result_field

function anonymize(Ks: (Long, Long), dimensions: Set[String], dataset: List[Rows]):
  statistics_table = build_statistics(dimensions, dataset)
  do:
    var grouped_dataset = dataset.groupBy(dimensions).agg(count(distinct ip) as group_d_ips, count(distinct page_title) as group_d_pvs)
    var anonymized_rows = 0
    for (group in grouped_dataset):
      if (group.group_d_ips < K._1):
        for (row in group.rows):
          row[getFieldToAnonymize(row, 0, dimensions, statistics_table)] = "dummy_value"
          anonymized_rows++;
      else:
        if (group.group_d_pvs < K._2):
          for (row in group.rows):
            row[getFieldToAnonymize(row, 1, dimensions, statistics_table)] = "dummy_value"
            anonymized_rows++;
  while (anonymized_rows > 0)
  return dataset

Previous Idea - Using SUM(views_count) as trigger

The original proposal was to use SUM(views_count) as a trigger, with a K value making sense for that scale. This approach was (a lot) less precise in the rows to anonymize: rows where hacking value resides are the ones that, for the same finger-printing group, have a very small number of distinct IPs, and in particular groups having a high number of views. Using the SUM(views_count) as a trigger would have prevented us to distinguish between groups with a small number of distinct IPs but quite a lot of requests (high hacking value) and groups with high number of distinct IPs but a small number of requests per IP (low hacking value). Changing the anonymization trigger to use the number of distinct IPs for each finger-printing group ensures that we anonymize dangerous groups instead, whatever it's number view_count.

In addition to ensuring anonymization of groups having a big number of views, this approach also have us not anonymizing rows with a reasonably small number of IPs and with a small number of views, preserving more data.