Jump to content

User:Joal/JanusGraph

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page documents my work-log in playing with JanusGraph.

WDQS

Wikidata

Janus/Gremlin/Tinkerpop

2019-09-06 - Install and tests on Cloud VPS

I have already made an install of JanusGraph on cloud-VPS, but it was almost a year ago at All-Hands. Starting fresh :)

I'm using (JanusGraph needs Java 1.8) and JanusGraph 0.0.4 (latest as of 2019-09-06)

Install and test

  • I created the janus1-1 large instance using Debian 9.9 Stretch (java 8 needed) in the cloud-VPS analytics project with Horizon
  • I followed the introduction section of https://docs.janusgraph.org/, changing ElasticSearch index-backend to Lucene (single node test).

Install

ssh janus1-1.analytics.eqiad.wmflabs

sudo apt-get install unzip openjdk-8-jre

wget https://github.com/JanusGraph/janusgraph/releases/download/v0.4.0/janusgraph-0.4.0-hadoop2.zip

unzip janusgraph-0.4.0-hadoop2.zip

cd janusgraph-0.4.0-hadoop2

./bin/gremlin.sh

Test

/**********************************************
  Configure and load graph
**********************************************/

// Create graph with updated configuration (Lucen instead of ES)
graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje-lucene.properties')

// Load graph example
GraphOfTheGodsFactory.load(graph)

// Create graph traversal object
g = graph.traversal()

/**********************************************
  Test graph traversal
**********************************************/

// Create a pointer to the Saturn node using index on name
saturn = g.V().has('name', 'saturn').next()

// Show the Saturn node pointer values ([name:[saturn], age:[10000]])
g.V(saturn).valueMap()

// Use the Saturn node pointer to find Saturn grand-child name (hercules)
g.V(saturn).in('father').in('father').values('name')
==>hercules

// Use geo index to find edges having a place property within 50km of Athen (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50)))

// Find nodes connected to the edges found by geo-index query and show their names (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50))).
  as('source').inV().as('god2').
  select('source').outV().as('god1').
  select('god1', 'god2').by('name')

2019-09-16 - Analyze and prepare Wikidata-truthy for loading

(started in 2019-09-06 session)

Load dump

import org.apache.spark.sql.functions._

val dump_path = "/user/joal/wmf/data/raw/mediawiki/wikidata/truthy_ntdumps/20190904"
val df = spark.read.format("csv").
  option("mode", "FAILFAST").
  option("delimiter", " ").
  load(dump_path).
  withColumnRenamed("_c0", "origin").
  withColumnRenamed("_c1", "link").
  withColumnRenamed("_c2", "dest").
  drop("_c3").
  cache()
  
df.count()
// 4139056936 - Wow!!!

df.where("origin is null or link is null or dest is null").count()
// 0 - \o/ well-formed data

df.select("origin").distinct().count()
// 124151595

df.select("dest").distinct().count()
// 685067856

df.select("link").distinct().count()
// 6516
// Check http://www.wikidata.org/prop links
df.where("link like '<http://www.wikidata.org/prop%'").select("link").distinct.count
// 6486                                                              
df.where("link like '<http://www.wikidata.org/prop/direct/%'").select("link").distinct.count
// 6351                                                              
df.where("link like '<http://www.wikidata.org/prop/direct-normalized/%'").select("link").distinct.count
// 135 direct or direct-normalized only - GOOD :)

// Check other link types and evaluate whether to keep them or not
df.where("link not like '<http://www.wikidata.org/prop%'").groupBy("link").count.sort(desc("count")).show(100, false)
/*
+----------------------------------------------------+----------+               
|link                                                |count     |
+----------------------------------------------------+----------+

** To Keep (in addition to direct and direct-normalized links):
|<http://www.w3.org/2002/07/owl#sameAs>              |2464024   |


** To remove:
  ** We drop all language related classes
|<http://schema.org/name>                            |322876582 |
|<http://schema.org/description>                     |2014877520|
|<http://www.w3.org/2004/02/skos/core#prefLabel>     |322876582 |
|<http://www.w3.org/2000/01/rdf-schema#label>        |322876582 |
|<http://www.w3.org/2004/02/skos/core#altLabel>      |67929447  |

  ** We drop metadata
|<http://schema.org/dateModified>                    |62033634  |
|<http://schema.org/version>                         |62033306  |
|<http://schema.org/about>                           |62033306  |

  ** We drop redondant info (this is described as link-property)
|<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>   |121788917 |

// Origin is PXXX and dest is a derivative of PXXX without other usage (origin or dest)
|<http://wikiba.se/ontology#qualifier>               |6595      |
|<http://www.w3.org/2002/07/owl#someValuesFrom>      |6595      |
|<http://wikiba.se/ontology#claim>                   |6595      |
|<http://wikiba.se/ontology#statementProperty>       |6595      |
|<http://www.w3.org/2002/07/owl#onProperty>          |6595      |
|<http://wikiba.se/ontology#referenceValue>          |6595      |
|<http://wikiba.se/ontology#reference>               |6595      |
|<http://wikiba.se/ontology#directClaim>             |6595      |
|<http://wikiba.se/ontology#statementValue>          |6595      |
|<http://wikiba.se/ontology#qualifierValue>          |6595      |
|<http://wikiba.se/ontology#directClaimNormalized>   |4758      |
|<http://wikiba.se/ontology#referenceValueNormalized>|4758      |
|<http://wikiba.se/ontology#statementValueNormalized>|4758      |
|<http://wikiba.se/ontology#qualifierValueNormalized>|4758      |

  ** Used for dumps info only (a lot of same rows ... weird)
|<http://www.w3.org/2002/07/owl#imports>             |328       |
|<http://schema.org/softwareVersion>                 |328       |
|<http://creativecommons.org/ns#license>             |328       |

// Interesting for value interpretation (kept in own dataset)
|<http://wikiba.se/ontology#propertyType>            |6595    |

// Values from link is also used as origin -- Seems not used in truthy
|<http://wikiba.se/ontology#novalue>                 |6595    |
// Used with previous  -^
|<http://www.w3.org/2002/07/owl#complementOf>        |6595    |
+----------------------------------------------------+----------+

Checking code samples (to be updated for each link type and format):
df.where("link = '<http://www.w3.org/2002/07/owl#someValuesFrom>'").show(20, false)
df.where("link = '<http://www.w3.org/2002/07/owl#someValuesFrom>'").selectExpr("split(origin, '/')[4] as o", "split(dest, '/')[4] as d").where("o <> d").count
df.where("""
  origin like '<http://www.wikidata.org/prop/P%'
  AND link != '<http://www.w3.org/2002/07/owl#someValuesFrom>'""").show(20, false)
*/

val fdf = df.where("""
      -- Dropping descriptions, labels, versions...
      link NOT IN (
        '<http://schema.org/name>',
        '<http://schema.org/description>',
        '<http://www.w3.org/2004/02/skos/core#prefLabel>',
        '<http://www.w3.org/2000/01/rdf-schema#label>',
        '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
        '<http://www.w3.org/2004/02/skos/core#altLabel>',
        '<http://schema.org/dateModified>',
        '<http://schema.org/about>',
        '<http://schema.org/version>',
        '<http://wikiba.se/ontology#claim>',
        '<http://wikiba.se/ontology#statementProperty>',
        '<http://wikiba.se/ontology#qualifier>',
        '<http://wikiba.se/ontology#directClaim>',
        '<http://wikiba.se/ontology#statementValue>',
        '<http://wikiba.se/ontology#qualifierValue>',
        '<http://wikiba.se/ontology#reference>',
        '<http://www.w3.org/2002/07/owl#onProperty>',
        '<http://wikiba.se/ontology#referenceValue>',
        '<http://wikiba.se/ontology#statementValueNormalized>',
        '<http://wikiba.se/ontology#referenceValueNormalized>',
        '<http://wikiba.se/ontology#directClaimNormalized>',
        '<http://wikiba.se/ontology#qualifierValueNormalized>',
        '<http://www.w3.org/2002/07/owl#someValuesFrom>',
        '<http://wikiba.se/ontology#novalue>',
        '<http://www.w3.org/2002/07/owl#complementOf>',
        '<http://wikiba.se/ontology#propertyType>'
      ) AND  origin != '<http://wikiba.se/ontology#Dump>'
  """).cache()
  
fdf.count()
// 825663549 -- Better - Need some naming effort

// checking and defining property types
df.where("link = '<http://wikiba.se/ontology#propertyType>' and origin not like '<http://www.wikidata.org/entity/P%'").count
// 0

val propertyTypes = df.
  where("link = '<http://wikiba.se/ontology#propertyType>'").
  selectExpr("replace(split(origin, '/')[4], '>', '') AS property", "replace(split(dest, '#')[1], '>', '') as propertyType").
  cache()

Rename values for simplicity and size, and pivot some data (names and property-types)

// Check origin values
fdf.where("origin not like '<http://www.wikidata.org/entity/%'").count
// 0 - We have the scheme :)

val fdfr1 = fdf.selectExpr(
  "replace(split(origin, '/')[4], '>', '')AS origin",
  """CASE
        -- Dropping difference between direct and direct-normalized (only used for ExternalId)
        WHEN link like '<http://www.wikidata.org/prop/%' THEN replace(split(link, '/')[5], '>', '')
        WHEN link = '<http://www.w3.org/2002/07/owl#sameAs>' THEN 'SameAs'
        ELSE link
  END as link""",
  """CASE
        WHEN link = '<http://schema.org/name>' THEN replace(dest, '@en', '')
        ELSE dest
  END as dest"""
).cache()


val fdfj1 = fdfr1.join(propertyTypes, col("link") === col("property"), "left").drop("property").cache
fdfj1.groupBy("propertyType").count().sort(desc("count")).show(100, false)
/*
+----------------+---------+                                                    
|propertyType    |count    |
+----------------+---------+
|WikibaseItem    |387850621|
|String          |169523450|
|ExternalId      |135193147|
|Time            |32810131 |
|Monolingualtext |27807321 |
|Quantity        |9643796  |
|GlobeCoordinate |7603827  |
|CommonsMedia    |3651002  |
|Url             |3045010  |
|null            |2464024  |
|WikibaseProperty|24410    |
|Math            |4105     |
|GeoShape        |2844     |
|WikibaseLexeme  |1299     |
|MusicalNotation |291      |
|TabularData     |16       |
|WikibaseSense   |13       |
|WikibaseForm    |2        |
+----------------+---------+

*/

// Looking for dest renaming scheme

fdfj1.where("propertyType is null").select("link").distinct.show(20, false)
+------+                                                                        
|link  |
+------+
|SameAs|
+------+

fdfj1.where("propertyType = 'ExternalId'").show(20, false)

fdfj1.where("link = 'SameAs' and dest not like '<http://www.wikidata.org/entity/Q%'").count
// 0
fdfj1.where("""propertyType = 'WikibaseProperty'
                 AND dest not like '<http://www.wikidata.org/entity/P%'
                 AND dest not like '_:genid%'""").count
// 0
fdfj1.where("""propertyType = 'WikibaseLexeme'
                 AND dest not like '<http://www.wikidata.org/entity/L%'
                 """).count
// 0
fdfj1.where("""propertyType = 'WikibaseSense'
                 AND dest not like '<http://www.wikidata.org/entity/L%'
                 """).count
// 0
fdfj1.where("""propertyType = 'WikibaseForm'
                 AND dest not like '<http://www.wikidata.org/entity/L%'
                 """).count

fdfj1.where("dest like '%^^<%'").groupBy("propertyType").count().sort(desc("count")).show(100, false)
/*
+-------------------------------------------+--------+                          
|linkPropType                               |count   |
+-------------------------------------------+--------+
|<http://wikiba.se/ontology#Time>           |32779728|
|<http://wikiba.se/ontology#Quantity>       |9643081 |
|<http://wikiba.se/ontology#GlobeCoordinate>|7602994 |
|<http://wikiba.se/ontology#Math>           |4105    |
+-------------------------------------------+--------+
*/
fdfj1.where("dest like '%^^<%'").selectExpr("split(replace(dest, '^^', ';;'), ';;')[1] as typ", "propertyType").groupBy("typ", "propertyType").count().sort(desc("count")).show(100, false)
/*
+-------------------------------------------------+---------------+--------+    
|typ                                              |propertyType   |count   |
+-------------------------------------------------+---------------+--------+
|<http://www.w3.org/2001/XMLSchema#dateTime>      |Time           |32779728|
|<http://www.w3.org/2001/XMLSchema#decimal>       |Quantity       |9643081 |
|<http://www.opengis.net/ont/geosparql#wktLiteral>|GlobeCoordinate|7602994 |
|<http://www.w3.org/1998/Math/MathML>             |Math           |4105    |
+-------------------------------------------------+---------------+--------+
We can get rid of the inner-value type :)
*/

fdfj1.where("""linkProptype = '<http://wikiba.se/ontology#WikibaseItem>'
                 AND dest not like '<http://www.wikidata.org/entity/Q%'
                 AND dest not like '_:genid%'""").count
// 0 - \o/ only origin values :)




// Renaming values in 2 stages to remove doule-quotes
val fdfr2 = fdfj1.selectExpr(
  "origin",
  "link",
  """CASE
        WHEN link = 'SameAs' THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType = 'WikibaseItem' AND dest like '<http://www.wikidata.org/entity/Q%'
          THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType = 'WikibaseProperty' AND dest like '<http://www.wikidata.org/entity/P%'
          THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType IN ('WikibaseLexeme', 'WikibaseSense', 'WikibaseForm')
          THEN replace(split(dest, '/')[4], '>', '')
        WHEN propertyType IN ('Time', 'Quantity', 'GlobeCoordinate', 'Math') and dest like '%^^<%' THEN split(replace(dest, '^^', ';;'), ';;')[0]
        ELSE dest
  END as dest""").cache()

val fdfr3 = fdfr2.selectExpr(
  "origin",
  "link",
  """CASE
        WHEN dest rlike '^"[^"]*"$' THEN replace(dest, '"', '')
        ELSE dest
  END as dest""").cache()

fdfr3.repartition(8).write.mode("overwrite").option("compression", "gzip").json("/user/joal/test_wdqs/truthy_20190916")