Data Platform/Data Lake/Traffic/Pageview hourly/Fingerprinting Over Time
Appearance
(Redirected from Analytics/Data/Pageview hourly/Fingerprinting Over Time)
In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path. See this page for a broad view on the pageview_hourly sanitization project.
This page go through as fingerprinting analysis over time, using webrequest and pageview_hourly datasets.
Data Preparation
CREATE TABLE hashes1 STORED AS PARQUET AS
SELECT
hash(
pageview_info['project'],
pageview_info['language_variant'],
access_method,
x_analytics_map['zero'],
agent_type,
geocoded_data['continent'],
geocoded_data['country_code'],
geocoded_data['country'],
geocoded_data['subdivision'],
geocoded_data['city'],
user_agent_map) as h,
COUNT(DISTINCT ip) as dip,
COUNT(DISTINCT pageview_info['page_title']) AS dpv,
COUNT(1) as c FROM wmf.webrequest
WHERE webrequest_source = 'text'
AND year = 2016
AND month = 1
AND day = 11
AND agent_type = 'user'
GROUP BY
hash(
pageview_info['project'],
pageview_info['language_variant'],
access_method,
x_analytics_map['zero'],
agent_type,
geocoded_data['continent'],
geocoded_data['country_code'],
geocoded_data['country'],
geocoded_data['subdivision'],
geocoded_data['city'],
user_agent_map);
WITH pv_hashes1 AS
(
SELECT
year, month, day,
hash(project, language_variant, access_method, zero_carrier, agent_type,
continent, country_code, country, subdivision, city, user_agent_map) as pvh,
SUM(view_count) as vc
FROM wmf.pageview_hourly
WHERE ((year = 2015 AND month > 5) OR (year = 2016 AND month = 1 AND day < 12))
AND agent_type = 'user'
GROUP BY
year,
month,
day,
hash(project, language_variant, access_method, zero_carrier, agent_type,
continent, country_code, country, subdivision, city, user_agent_map)
)
SELECT
year, month, day,
SUM(CASE WHEN dip = 1 THEN 1 ELSE 0 END) as dip1_n,
SUM(CASE WHEN dip = 2 THEN 1 ELSE 0 END) as dip2_n,
SUM(CASE WHEN dip = 3 THEN 1 ELSE 0 END) as dip3_n,
SUM(CASE WHEN dip = 4 THEN 1 ELSE 0 END) as dip4_n,
SUM(CASE WHEN dip = 5 THEN 1 ELSE 0 END) as dip5_n,
SUM(CASE WHEN dip > 5 THEN 1 ELSE 0 END) as dipPlus_n,
SUM(CASE WHEN dip = 1 THEN pvh1.vc ELSE 0 END) as dip1_vc,
SUM(CASE WHEN dip = 2 THEN pvh1.vc ELSE 0 END) as dip2_vc,
SUM(CASE WHEN dip = 3 THEN pvh1.vc ELSE 0 END) as dip3_vc,
SUM(CASE WHEN dip = 4 THEN pvh1.vc ELSE 0 END) as dip4_vc,
SUM(CASE WHEN dip = 5 THEN pvh1.vc ELSE 0 END) as dip5_vc,
SUM(CASE WHEN dip > 5 THEN pvh1.vc ELSE 0 END) as dipPlus_vc
FROM pv_hashes1 pvh1
INNER JOIN pv_san.hashes1 h1
ON (pvh1.pvh = h1.h)
GROUP BY
year, month, day
ORDER BY year, month, day
LIMIT 1000;
Results

As shown below, the decrease of number of hashes match is almost linear. Nonetheless the remaining number of hashes from December still existing in June is very substantial (over a million distinct hashes), particularly with hashes involving only one IP, the most dangerous bucket in term of browsing pattern reconstruction.