Analyzing Yelp dataset in neo4j using docker
Last updated: May 15, 2022
Yelp is a website which publishes crowd-sourced reviews about business. The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review of their products or services using a one to five star rating scale.Businesses can also update contact information, hours, and other basic listing information or add special deals. In addition to writing reviews, users can react to reviews, plan events, or discuss their personal lives.
It has been operating since 2004, expended in other countries amd became a public company in 2012. As of December 31, 2021, approximately 244.4 million reviews were available on its business listing pages. In 2021, the company had 46 million unique visitors to its desktop webpages and 56.7 million unique visitors to its mobile sites.
Since 2014, every year Yelp publishes a sample of its dataset and open a data science challenge.
In this article we will extract, transform and load this dataset in neo4j using it in a docker container. The advantages of using docker is that you have an isolated and portable environment for running neo4j at its latest version with all the dependencies need to run it, such as the java-jdk and the neo4j plugins (APOC and Graph Data Science).
1. Getting the dataset from Yelp
The dataset can be downloaded from here. They are large files and therefore they need to be upload into neo4j using neo4j-admin import
.
In fact there are two ways to import data from CSV
files into neo4j: via neo4j-admin import
or LOAD CSV
.
With the neo4j-admin import
command, you can do batch imports of large amounts of data into a previously unused database database from CSV files. The command can be performed only once per database. By default, this database is set to neo4j, but you can use the --database=<database>
option to import your data into a different database.
With LOAD CSV
, you can import small to medium-sized CSV files into an existing database. LOAD CSV
can be run as many times as needed and does not require an empty database.
2. Preparing the dataset for bulk upload
Since the yelp
dataset is in json
format, we need to transform them into CSV
files, and for this task we are going to use the following python script:
import csv
import json
import os
import hashlib
def write_header(file_name, columns):
with open(file_name, 'w') as file_csv:
writer = csv.writer(file_csv)
writer.writerow(columns)
if __name__ == "__main__":
print("JSON to CSV Data converter starting ...")
if not os.path.isfile("data/business_header.csv"):
with open("dataset/business.json") as business_json, \
open("data/business.csv", 'w+') as business_csv:
write_header("data/business_header.csv", ['id:ID(Business)', 'name', 'address', 'city', 'state'])
business_writer = csv.writer(business_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
for line in business_json.readlines():
item = json.loads(line)
try:
business_writer.writerow(
[item['business_id'], item['name'], item['address'], item['city'], item['state']])
except Exception as e:
print(item)
raise e
print("business dataset converted ...")
if not os.path.isfile("data/city_header.csv"):
with open("dataset/business.json") as business_json, \
open("data/city.csv", "w+") as city_csv, \
open("data/business_IN_CITY_city.csv", "w") as business_city_csv:
write_header("data/city_header.csv", ['name:ID(City)'])
write_header("data/business_IN_CITY_city_header.csv", [':START_ID(Business)', ':END_ID(City)'])
business_city_writer = csv.writer(business_city_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
city_writer = csv.writer(city_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
unique_cities = set()
for line in business_json.readlines():
item = json.loads(line)
if item["city"].strip():
unique_cities.add(item["city"])
business_city_writer.writerow([item["business_id"], item["city"]])
for city in unique_cities:
city_writer.writerow([city])
print("business -> city mapped ...")
if not os.path.isfile("data/user_header.csv"):
with open("dataset/user.json") as user_json, \
open("data/user.csv", 'w+') as user_csv, \
open("data/user_FRIENDS_user.csv", 'w+') as user_user_csv:
write_header("data/user_header.csv", ['id:ID(User)', 'name'])
write_header("data/user_FRIENDS_user_header.csv", [':START_ID(User)', ':END_ID(User)'])
user_writer = csv.writer(user_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
user_user_writer = csv.writer(user_user_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
for line in user_json.readlines():
item = json.loads(line)
user_writer.writerow([item["user_id"], item["name"]])
for friend_id in item["friends"]:
user_user_writer.writerow([item["user_id"], friend_id])
print("User dataset converted ...")
if not os.path.isfile("data/review_header.csv"):
with open("dataset/review.json") as review_json, \
open("data/review.csv", 'w+') as review_csv, \
open("data/user_WROTE_review.csv", 'w+') as user_review_csv, \
open("data/review_REVIEWS_business.csv", 'w+') as review_business_csv:
write_header("data/review_header.csv", ['id:ID(Review)', 'text', 'stars:int', 'date'])
write_header("data/user_WROTE_review_header.csv", [':START_ID(User)', ':END_ID(Review)'])
write_header("data/review_REVIEWS_business_header.csv", [':START_ID(Review)', ':END_ID(Business)'])
review_writer = csv.writer(review_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
user_review_writer = csv.writer(user_review_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
review_business_writer = csv.writer(review_business_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
for line in review_json.readlines():
item = json.loads(line)
review_writer.writerow([item["review_id"], item["text"], item["stars"], item["date"]])
user_review_writer.writerow([item["user_id"], item["review_id"]])
review_business_writer.writerow([item["review_id"], item["business_id"]])
print("Review dataset converted ... Game Over")
if not os.path.isfile("data/tip_header.csv"):
with open("dataset/tip.json") as tip_json, \
open("data/tip.csv", 'w+') as tip_csv, \
open("data/user_WROTE_tip.csv", 'w+') as user_tip_csv, \
open("data/tip_ABOUT_business.csv", 'w+') as tip_business_csv:
write_header("data/tip_header.csv", ['id:ID(tip)', 'text', 'stars:int', 'date',':LABEL'])
write_header("data/user_WROTE_tip_header.csv", [':START_ID(User)', ':END_ID(tip)',':TYPE'])
write_header("data/tip_ABOUT_business_header.csv", [':START_ID(tip)', ':END_ID(Business)',':TYPE'])
tip_writer = csv.writer(tip_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
user_tip_writer = csv.writer(user_tip_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
tip_business_writer = csv.writer(tip_business_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
for line in tip_json.readlines():
item = json.loads(line)
item_text=hashlib.shake_256(item["text"].encode()).hexdigest(5)
tip_writer.writerow([item_text, item["text"], item["date"],"Tip"])
user_tip_writer.writerow([item["user_id"], item_text,"WROTE_TIP"])
tip_business_writer.writerow([item_text, item["business_id"],"TIP_ABOUT"])
print("Tip dataset converted ... Game Over")
if not os.path.isfile("data/area_header.csv"):
with open("dataset/businessLocations.json") as business_locations_json, \
open("data/area.csv", "w+") as area_csv, \
open("data/country.csv", "w+") as country_csv, \
open("data/city_IN_AREA_area.csv", "w+") as city_area_csv, \
open("data/area_IN_COUNTRY_country.csv", "w+") as area_country_csv:
input = json.load(business_locations_json)
write_header("data/area_header.csv", ['name:ID(Area)'])
write_header("data/country_header.csv", ['name:ID(Country)'])
write_header("data/city_IN_AREA_area_header.csv", [':START_ID(City)', ':END_ID(Area)'])
write_header("data/area_IN_COUNTRY_country_header.csv", [':START_ID(Area)', ':END_ID(Country)'])
area_writer = csv.writer(area_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
country_writer = csv.writer(country_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
city_area_writer = csv.writer(city_area_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
area_country_writer = csv.writer(area_country_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
unique_areas = set()
unique_countries = set()
unique_city_areas = set()
unique_area_countries = set()
for business_id in input:
if input[business_id]["admin1"]:
unique_areas.add(input[business_id]["admin1"])
unique_countries.add(input[business_id]["country"])
unique_city_areas.add((input[business_id]["city"], input[business_id]["admin1"]))
unique_area_countries.add((input[business_id]["admin1"], input[business_id]["country"]))
for area in unique_areas:
area_writer.writerow([area])
for country in unique_countries:
country_writer.writerow([country])
for city, area in unique_city_areas:
city_area_writer.writerow([city, area])
for area, country in unique_area_countries:
area_country_writer.writerow([area, country])
print("Area Country dataset converted ...")
if not os.path.isfile("data/category_header.csv"):
with open("dataset/business.json") as business_json, \
open("data/category.csv", 'w+') as categories_csv, \
open("data/business_IN_CATEGORY_category.csv", 'w+') as business_category_csv:
write_header("data/category_header.csv", ['name:ID(Category)'])
write_header("data/business_IN_CATEGORY_category_header.csv", [':START_ID(Business)', ':END_ID(Category)'])
business_category_writer = csv.writer(business_category_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
category_writer = csv.writer(categories_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
unique_categories = set()
for line in business_json.readlines():
item = json.loads(line)
try:
if item["categories"]:
chunks = item["categories"].split(', ')
for category in chunks:
unique_categories.add(category)
business_category_writer.writerow([item["business_id"], category])
except ValueError as err:
print(err)
for category in unique_categories:
try:
category_writer.writerow([category])
except Exception as e:
raise e
print("business Category dataset converted ...")
It is important that files that defines nodes
, have this format:
business_header.csv, notice we define the field id
as ID
of a node of type Business in neo4j.
id:ID(Business),name,address,city,state
business.csv, notice that the data is according to the colunm fields above, adding the field "Business" at end of each line.
"Pns2l4eNsfO8kk83dixA6A","Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2","Santa Barbara","CA","Business"
"mpf3x-BjTdTEA3yCZrAYPw","The UPS Store","87 Grasso Plaza Shopping Center","Affton","MO","Business"
"tUFrWirKiKi_TAnsVWINQQ","Target","5255 E Broadway Blvd","Tucson","AZ","Business"
"MTSW4McQd7CbVtyjqoe9mw","St Honore Pastries","935 Race St","Philadelphia","PA","Business"
"mWMc6_wTdE0EUBKIGXDVfA","Perkiomen Valley Brewery","101 Walnut St","Green Lane","PA","Business"
The same pattern for city_header.csv and city.csv
city_header.csv
name:ID(City)
city.csv
"Woolwich Twp.","City"
"McCarran","City"
"Berwyn","City"
"Mount Laurel Township","City"
"Huntingdon Valley PA","City"
"Newtown Square","City"
As for defining relationships, the format of the file should conform:
business_IN_CITY_header.csv, especifying the nodes types
:START_ID(Business),:END_ID(City)
business_IN_CITY.csv, having the ID of each node type according to the column, adding the name of the relationship, in the example below, "IN_CITY".
"Pns2l4eNsfO8kk83dixA6A","Santa Barbara","IN_CITY"
"mpf3x-BjTdTEA3yCZrAYPw","Affton","IN_CITY"
"tUFrWirKiKi_TAnsVWINQQ","Tucson","IN_CITY"
"MTSW4McQd7CbVtyjqoe9mw","Philadelphia","IN_CITY"
"mWMc6_wTdE0EUBKIGXDVfA","Green Lane","IN_CITY"
"CF33F8-E6oudUQ46HnavjQ","Ashland City","IN_CITY"
3. Neo4j latest version using docker-compose
Now that you have your files ready, you need to spin up neo4j. In this example we will spin up the latest version using docker. The docker-compose
script below specify neo4j and a jupyter notebook:
version: '3.7'
services:
neo4j:
image: neo4j:latest
container_name: 'neo-gds1.5'
# build . only for use Dockerfile
#build: .
labels:
description: Neo4j container with apoc and gds
name: refreshed_neo4j
ports:
- '7474:7474'
- '7473:7473'
- '7687:7687'
volumes:
- $PWD/neo4j/data:/data
# take care with permission
# sudo chmod -R a+rwX ./data
- $PWD/neo4j/import:/var/lib/neo4j/import
environment:
- NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
- NEO4J_AUTH=none
- NEO4J_dbms_active__database=neo4.db
# New version neo4j works with multiple databases. It will mount with a system database, but you can switch
- NEO4JLABS_PLUGINS=["apoc", "graph-data-science"]
- apoc.export.file.enabled=true
- apoc.import.file.enabled=true
#- NEO4J_dbms_memory_pagecache_size=1G
#- NEO4J_dbms_memory_heap_initial__size=4G
#- NEO4J_dbms_memory_heap_max__size=16G
networks:
- pyneo
notebook:
container_name: SNAP
#image: snap-img
build: .
labels:
description: Python3 with SNAP packages and Jupyter Notebooks
name: jupyter notebook
ports:
- '8888:8888'
volumes:
- .:/usr/snap/notebooks
# Run shell command for notebook on start
entrypoint: jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root
#runtime: "nvidia"
networks:
- pyneo
networks:
pyneo:
driver: 'bridge'
Dockerfile for the notebook:
# Docker image with SNAP network analysis tools
# http://snap.stanford.edu/snappy/
# Use an official Python runtime as a parent image
FROM python:3.9
MAINTAINER RS
# Set the working directory to /snap
WORKDIR /usr/snap
# Copy the current directory contents into the container at /snap
COPY requirements.txt /usr/snap
RUN pip install --upgrade pip --trusted-host pypi.python.org -r requirements.txt
and the requirements.txt
for the notebook libraries:
jupyter
requests
snap-stanford
numpy
pandas
scipy
networkx
neonx
neo4j
tabulate
with that, run:
docker-compose up
and your should have the neo4j at port 7474 and jupiter notebook ate the URL specified at the container script. The username and passwod for neo4j is neo4j for both.
4. Bulk upload using neo4j-admin import
In a terminal, get into the neo4j container, using
docker exec -it <neo4jcontainername> bash
got to the import folder, in where the CSV folders are. This folder should be mounted according to the volumes
specification in the docker-compose.yml file. Make shure the files are there from the side of your host.
Then run the neo4j-admin import
as below:
neo4j-admin import --database=neo4.db --force --nodes=user.csv --relationships=user_friends_header.csv,user_FRIEND_user.csv --nodes=business_header.csv,business.csv --nodes=category_header.csv,category.csv --relationships=business_IN_CATEGORY_category_header.csv,business_IN_CATEGORY_category.csv --nodes=review_header.csv,review.csv --relationships=review_REVIEWS_business_header.csv,review_REVIEWS_business.csv --relationships=user_WROTE_review_header.csv,user_WROTE_review.csv --nodes=city_header.csv,city.csv --relationships=business_IN_CITY_city_header.csv,business_IN_CITY_city.csv --nodes=tip_header.csv,tip.csv --relationships=tip_ABOUT_business_header.csv,tip_ABOUT_business.csv --relationships=user_WROTE_tip_header.csv,user_WROTE_tip.csv --skip-duplicate-nodes=true --skip-bad-relationships --ignore-empty-strings=true --multiline-fields=true
You should see a script like this:
Selecting JVM - Version:11.0.15+10, Name:OpenJDK 64-Bit Server VM, Vendor:Oracle Corporation
Neo4j version: 4.4.6
Importing the contents of these files into /data/databases/neo4.db:
Nodes:
/var/lib/neo4j/import/user.csv
/var/lib/neo4j/import/business_header.csv
/var/lib/neo4j/import/business.csv
/var/lib/neo4j/import/category_header.csv
/var/lib/neo4j/import/category.csv
/var/lib/neo4j/import/review_header.csv
/var/lib/neo4j/import/review.csv
/var/lib/neo4j/import/city_header.csv
/var/lib/neo4j/import/city.csv
/var/lib/neo4j/import/tip_header.csv
/var/lib/neo4j/import/tip.csv
Relationships:
/var/lib/neo4j/import/user_friends_header.csv
/var/lib/neo4j/import/user_FRIEND_user.csv
/var/lib/neo4j/import/business_IN_CATEGORY_category_header.csv
/var/lib/neo4j/import/business_IN_CATEGORY_category.csv
/var/lib/neo4j/import/review_REVIEWS_business_header.csv
/var/lib/neo4j/import/review_REVIEWS_business.csv
/var/lib/neo4j/import/user_WROTE_review_header.csv
/var/lib/neo4j/import/user_WROTE_review.csv
/var/lib/neo4j/import/business_IN_CITY_city_header.csv
/var/lib/neo4j/import/business_IN_CITY_city.csv
/var/lib/neo4j/import/tip_ABOUT_business_header.csv
/var/lib/neo4j/import/tip_ABOUT_business.csv
/var/lib/neo4j/import/user_WROTE_tip_header.csv
/var/lib/neo4j/import/user_WROTE_tip.csv
Available resources:
Total machine memory: 7.774GiB
Free machine memory: 4.707GiB
Max heap memory : 1.945GiB
Processors: 3
Configured max memory: 5.246GiB
High-IO: true
Type normalization:
Property type of 'review_count' normalized from 'int' --> 'long' in /var/lib/neo4j/import/user.csv
Property type of 'fans' normalized from 'int' --> 'long' in /var/lib/neo4j/import/user.csv
Import starting 2022-05-09 13:16:24.119+0000
Estimated number of nodes: 10.03 M
Estimated number of node properties: 68.36 M
Estimated number of relationships: 131.24 M
Estimated number of relationship properties: 229.24 M
Estimated disk space usage: 14.27GiB
Estimated required memory usage: 1.119GiB
(1/4) Node import 2022-05-09 13:16:24.129+0000
Estimated number of nodes: 10.03 M
Estimated disk space usage: 5.732GiB
Estimated required memory usage: 1.119GiB
.......... .......... .......... .......... .......... 5% β9s 659ms
.......... .......... .......... .......... .......... 10% β19s 70ms
.......... .......... .......... .......... .......... 15% β25s 288ms
.......... .......... .......... .......... .......... 20% β23s 873ms
.......... .......... .......... .......... .......... 25% β24s 873ms
.......... .......... .......... .......... .......... 30% β26s 70ms
.......... .......... .......... ...-...... .......... 35% β266ms
.......... .......... .......... .......... .......... 40% β0ms
.......... .......... .......... .......... .......... 45% β1s 203ms
.......... .......... .......... .......... .......... 50% β1s 2ms
.......... .......... .......... .......... .......... 55% β1s
.......... .......... .......... .......... .......... 60% β802ms
.......... .......... .......... .......... .......... 65% β802ms
.......... .......... .......... .......... .......... 70% β201ms
.......... .......... .......... .......... .......... 75% β201ms
.......... .......... .......... .......... .......... 80% β202ms
.......... .......... .......... .......... .......... 85% β200ms
.......... .......... .......... .......... .......... 90% β1ms
.......... .......... .......... .......... .......... 95% β0ms
.......... .......... ......-... .......... .......... 100% β1s 529ms
Node import COMPLETED in 2m 21s 997ms
...
Post processing COMPLETED in 41s 611ms
IMPORT DONE in 38m 56s 412ms.
Imported:
9982283 nodes
31229043 relationships
100244985 properties
Peak memory usage: 1.126GiB
There were bad entries which were skipped and logged into /var/lib/neo4j/import/import.report
5. Check it out in neo4j
To check it ou, log into neo4j at localhost:7474
, select the neo4.db
database and run a schema visualization.
// What is related, and how
CALL db.schema.visualization()
In a next article, we will start making some analysis in the dataset using neo4j cypher and also the jupyter notebook.