Michael Blum

Developer from Chicago

XKCD on Neo4j


XKCD On Neo4j

This is the first post in a series on importing different datasets into Neo4j - a graph database.

Importing xkcd into a database is neither new nor novel but can offer an interesting introduction into working with graph databases that while could be accomplished with other technologies such as sqlite, redis, solr, etc - there are some very cool things that a graph can tell us about Randall Munroe’s amazing comic.

XKCD API

XKCD now has a JSON API that has all of the comics (title, meta, alt-text) in an easy to consume and parse. You can find the starting point here:

XKCD JSON API

and a reference to the latest issue here:

Latest Issue

While there are innumerable ways to consume a JSON API I thought it would be interesting to trying and accomplish this entirely in Neo4j without any external dependencies or code.

Getting Started

To start things off - go and download the latest version of Neo4j (4.1.3 as of this writing) Download Neo4j. You can use desktop or Docker as well but I prefer to use the native tarball for simplicity’s sake. Explode the tarball and make sure >= java 11 is on your system.

For accessibility this will all work with the community edition but you can try out enterprise as well.

Configuring Neo4j

Make a backup of your neo4j.conf as we’ll be making some minor tweaks:

cp conf/neo4j.conf conf/neo4j.conf.default

First we want to make the following tweaks:

# allow apoc procedures and functions to execute
dbms.security.procedures.unrestricted=apoc.*

For tl;dr you can grab the configured conf here:

neo4j.conf

Then copy the apoc plug-in from the bundled labs directory:

cp labs/apoc-4.1.0.3-core.jar plugins/

Before starting neo4j we want to set a password:

./bin/neo4j-admin set-initial-password "*********"

If this doesn’t work then you can reset the password following this guide:

Reset Neo4j Password

Once your neo4j.conf and plugins, and password are sorted we can go ahead and startup neo4j.

Fetch XKCD Data With Cypher

The full query can be found here

but what’s the fun in that? Let’s unpack how we can iteratively pull the latest XKCD comics into our database.

Using the APOC procedure apoc.periodic.commit we can load the data into Neo4j across multiple batches - avoiding overloading the database. To that end we want to have our query only grab the latest issues from what we have cached and avoid re-downloading 1000s of comics every time ew run our import.

CALL apoc.periodic.commit("
  OPTIONAL MATCH (current:Xkcd) WITH current ORDER BY current.id DESC LIMIT 1
  OPTIONAL MATCH (oldest:Xkcd) WITH current, oldest ORDER BY oldest.id ASC LIMIT 1
  WITH current.id AS current_issue, 
  CASE
    WHEN oldest IS NULL THEN 1 
    WHEN oldest.id = 1 THEN current.id END
  AS oldest_issue, 'https://xkcd.com/info.0.json' AS uri
  CALL apoc.load.jsonParams(uri, null, null)
  YIELD value WITH value.num AS latest_issue, current_issue, oldest_issue
  WITH latest_issue, current_issue, oldest_issue, 
  CASE WHEN oldest_issue = 1 AND latest_issue = current_issue
		THEN []
		ELSE RANGE(latest_issue, oldest_issue, -1) END 
  AS issue_ids

Create An Issue

  UNWIND issue_ids AS issue_id
  WITH latest_issue, oldest_issue, 'https://xkcd.com/' + issue_id + '/info.0.json' AS uri
  CALL apoc.load.jsonParams(uri, null, null, null, {failOnError: false})
  YIELD value
  MERGE (issue:Xkcd {id: value.num})
         SET issue.link = CASE WHEN TRIM(value.link) <> '' THEN TRIM(value.link) ELSE NULL END,
             issue.news = CASE WHEN TRIM(value.news) <> '' THEN TRIM(value.news) ELSE NULL END,
             issue.safe_title = value.safe_title,
             issue.transcript = CASE WHEN TRIM(value.transcript) <> '' THEN TRIM(value.transcript) ELSE NULL END,
             issue.alt = CASE WHEN TRIM(value.alt) <> '' THEN TRIM(value.alt) ELSE NULL END,
             issue.img = value.img,
             issue.title = value.title,
             issue.date = date({year: toInteger(value.year), month: toInteger(value.month), day: toInteger(value.day)})
  WITH issue, latest_issue
  OPTIONAL MATCH (xkcd:Xkcd)
  RETURN latest_issue - COUNT(xkcd);") YIELD failedBatches RETURN failedBatches;

We want to keep going even if we fail to download an issue. failOnError covers the fun edge case that there is no XKCD #404 as that’s reserved for the HTTP error code of page not found.

Your next question might be what are all of these CASE statements? These tries let us massage and clean the data to be more efficient in Neo4j. In general its cheaper to check for the existence of a property vs is it null, empty, or contains text. Since the JSON payloads default to attr: "" we want to make those null in Neo4j so we’re not wasting memory and compute looking up attributes that don’t exist.

Indexes

When creating relationships between nodes - it is vital to create indexes BEFORE we try and create our relationships as they will execute much faster than without.

Neo4j supports a range of different indexes and we’ll be creating one of each here.

An index on a comic’s id (same as its issue #) will help with relationship creation:

// indexes

WITH 1 AS schema
WHERE NOT apoc.schema.node.indexExists('Xkcd', ['id'])
CALL db.createIndex('xkcd_id', ['Xkcd'], ['id'], 'native-btree-1.0') YIELD name, labels, properties
RETURN name, labels, properties;

Neo4j also supports indexes on temporal lookups (days, months, years) as well as relative (7 days ago). To facilitate that we want to create an index on the comic’s creation date:

WITH 1 AS schema
WHERE NOT apoc.schema.node.indexExists('Xkcd', ['date'])
CALL db.createIndex('xkcd_date', ['Xkcd'], ['date'], 'native-btree-1.0') YIELD name, labels, properties
RETURN name, labels, properties;

Lastly Neo4j support full-text search using a Lucene index. This makes groking across a lot of text fast and resource-efficient:

WITH 1 AS schema
WHERE NOT apoc.schema.node.indexExists('Xkcd', ['title', 'safe_title', 'transcript', 'alt'])
CALL db.index.fulltext.createNodeIndex("titlesAndDescriptions",["Xkcd"],["title", "safe_title", "transcript", "alt"]) 
RETURN schema;

Relationships

Our relationships will create a linked lists of each comic related to its previous and next issues. Some interesting edge cases to consider are:

  1. The first issue #1 will have no previous relationship
  2. the latest issue #N will have no next relationship

We can use these two cases to make very efficient queries as you’ll see.

// Relationships

MATCH (issue:Xkcd) 
  WITH issue ORDER BY issue.id ASC 
  WITH COLLECT(issue.id) AS issues 
CALL apoc.coll.zipToRows(issues, TAIL(issues)) YIELD value AS pair
  MATCH (a:Xkcd {id: pair[0]}) WITH pair, a
  MATCH (b:Xkcd {id: pair[1]}) WHERE a.id <> pair[1]
  MERGE (a)-[:NEXT]->(b)
  WITH a, b
  MERGE (a)<-[:PREV]-(b);

Execute

Now that we’ve written our query let’s execute it using cypher-shell:

cat xkcd.cypher | ./bin/cypher-shell -u neo4j -d neo4j -a bolt://localhost:7687

Querying XKCD

The initial load should take quite a while as we’re importing 2,000+ issues but the incremental executions should be much faster as we’re only importing the diff between our current issue and the latest one.

Querying our database we get a doubly linked list:

Double Linked-list XKCD Graph

As I alluded to previously the two edge cases of the oldest and latest issues make them very efficient to look up as we’re only checking for the lack of a relationship which is very cheap to query for vs looking up a property or doing something like:

MATCH (xkcd:Xkcd) RETURN xkcd ORDER BY xkcd.id DESC LIMIT 1;

profiles out to:

PROFILE MATCH (xkcd:Xkcd) RETURN xkcd ORDER BY xkcd.id DESC LIMIT 1;

versus a relationship lookup to find the latest issue:

MATCH (xkcd:Xkcd) WHERE NOT (xkcd)-[:NEXT]->() RETURN xkcd;

PROFILE MATCH (xkcd:Xkcd) WHERE NOT (xkcd)-[:NEXT]->() RETURN xkcd;

Similarly finding the oldest issue is just as efficient:

MATCH (xkcd:Xkcd) WHERE NOT (xkcd)-[:PREV]->() RETURN xkcd;

Another possible solution is to use the index but we end up needing to search for our node twice:

PROFILE MATCH (xkcd:Xkcd) WITH MAX(xkcd.id) AS latest MATCH (xkcd:Xkcd {id: latest}) RETURN xkcd;

PROFILE MATCH (xkcd:Xkcd) WITH MAX(xkcd.id) AS latest MATCH (xkcd:Xkcd {id: latest}) RETURN xkcd;

Finally if we want to verify our import script only gets the diff’d issues let’s delete the latest issue and re-import:

// delete the latest issue
MATCH (latest:Xkcd) WHERE NOT (latest)-[:NEXT]->() WITH latest LIMIT 1 DETACH DELETE latest;
// run the import

cat xkcd.cypher | ./bin/cypher-shell -u neo4j -d neo4j -a bolt://localhost:7687

Do giraffes ever get referenced in XKCD?

// full-text search across titles, alt-text, etc
CALL db.index.fulltext.queryNodes("titlesAndDescriptions", "giraffes") YIELD node, score
RETURN node.title, node.alt, score;

The power of Lucene indexes can’t be over-stated. Using this index we can grok across titles, alt-text, news, and transcripts.

Sources

xkcd.cypher

CALL apoc.periodic.commit("
  OPTIONAL MATCH (current:Xkcd) WITH current ORDER BY current.id DESC LIMIT 1
  OPTIONAL MATCH (oldest:Xkcd) WITH current, oldest ORDER BY oldest.id ASC LIMIT 1
  WITH current.id AS current_issue, 
  CASE
    WHEN oldest IS NULL THEN 1 
    WHEN oldest.id = 1 THEN current.id END
  AS oldest_issue, 'https://xkcd.com/info.0.json' AS uri
  CALL apoc.load.jsonParams(uri, null, null)
  YIELD value WITH value.num AS latest_issue, current_issue, oldest_issue
  WITH latest_issue, current_issue, oldest_issue, 
  CASE WHEN oldest_issue = 1 AND latest_issue = current_issue
		THEN []
		ELSE RANGE(latest_issue, oldest_issue, -1) END 
  AS issue_ids
  UNWIND issue_ids AS issue_id
  WITH latest_issue, oldest_issue, 'https://xkcd.com/' + issue_id + '/info.0.json' AS uri
  CALL apoc.load.jsonParams(uri, null, null, null, {failOnError: false})
  YIELD value
  MERGE (issue:Xkcd {id: value.num})
         SET issue.link = CASE WHEN TRIM(value.link) <> '' THEN TRIM(value.link) ELSE NULL END,
             issue.news = CASE WHEN TRIM(value.news) <> '' THEN TRIM(value.news) ELSE NULL END,
             issue.safe_title = value.safe_title,
             issue.transcript = CASE WHEN TRIM(value.transcript) <> '' THEN TRIM(value.transcript) ELSE NULL END,
             issue.alt = CASE WHEN TRIM(value.alt) <> '' THEN TRIM(value.alt) ELSE NULL END,
             issue.img = value.img,
             issue.title = value.title,
             issue.date = date({year: toInteger(value.year), month: toInteger(value.month), day: toInteger(value.day)})
  WITH issue, latest_issue
  OPTIONAL MATCH (xkcd:Xkcd)
  RETURN latest_issue - COUNT(xkcd);") YIELD failedBatches RETURN failedBatches;

// indexes

WITH 1 AS schema
WHERE NOT apoc.schema.node.indexExists('Xkcd', ['id'])
CALL db.createIndex('xkcd_id', ['Xkcd'], ['id'], 'native-btree-1.0') YIELD name, labels, properties
RETURN name, labels, properties;

WITH 1 AS schema
WHERE NOT apoc.schema.node.indexExists('Xkcd', ['date'])
CALL db.createIndex('xkcd_date', ['Xkcd'], ['date'], 'native-btree-1.0') YIELD name, labels, properties
RETURN name, labels, properties;

WITH 1 AS schema
WHERE NOT apoc.schema.node.indexExists('Xkcd', ['title', 'safe_title', 'transcript', 'alt'])
CALL db.index.fulltext.createNodeIndex("titlesAndDescriptions",["Xkcd"],["title", "safe_title", "transcript", "alt"]) 
RETURN schema;

// Relationships

MATCH (issue:Xkcd) 
  WITH issue ORDER BY issue.id ASC 
  WITH COLLECT(issue.id) AS issues 
CALL apoc.coll.zipToRows(issues, TAIL(issues)) YIELD value AS pair
  MATCH (a:Xkcd {id: pair[0]}) WITH pair, a
  MATCH (b:Xkcd {id: pair[1]}) WHERE a.id <> pair[1]
  MERGE (a)-[:NEXT]->(b)
  WITH a, b
  MERGE (a)<-[:PREV]-(b);

// find the latest issue
MATCH (xkcd:Xkcd) WHERE NOT (xkcd)-[:NEXT]->() RETURN xkcd;

// find the oldest issue
MATCH (xkcd:Xkcd) WHERE NOT (xkcd)-[:PREV]->() RETURN xkcd;

// delete the latest issue
MATCH (latest:Xkcd) WHERE NOT (latest)-[:NEXT]->() WITH latest LIMIT 1 DETACH DELETE latest;

// full-text search across titles, alt-text, etc
CALL db.index.fulltext.queryNodes("titlesAndDescriptions", "giraffes") YIELD node, score
RETURN node.title, node.alt, score;