yelp

Analyzing Yelp dataset in neo4j using docker

Last updated: May 15, 2022

Yelp is a website which publishes crowd-sourced reviews about business. The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review of their products or services using a one to five star rating scale.Businesses can also update contact information, hours, and other basic listing information or add special deals. In addition to writing reviews, users can react to reviews, plan events, or discuss their personal lives.

It has been operating since 2004, expended in other countries amd became a public company in 2012. As of December 31, 2021, approximately 244.4 million reviews were available on its business listing pages. In 2021, the company had 46 million unique visitors to its desktop webpages and 56.7 million unique visitors to its mobile sites.

Since 2014, every year Yelp publishes a sample of its dataset and open a data science challenge.

In this article we will extract, transform and load this dataset in neo4j using it in a docker container. The advantages of using docker is that you have an isolated and portable environment for running neo4j at its latest version with all the dependencies need to run it, such as the java-jdk and the neo4j plugins (APOC and Graph Data Science).

1. Getting the dataset from Yelp

The dataset can be downloaded from here. They are large files and therefore they need to be upload into neo4j using neo4j-admin import .

In fact there are two ways to import data from CSV files into neo4j: via neo4j-admin import or LOAD CSV.

With the neo4j-admin import command, you can do batch imports of large amounts of data into a previously unused database database from CSV files. The command can be performed only once per database. By default, this database is set to neo4j, but you can use the --database=<database> option to import your data into a different database.

With LOAD CSV, you can import small to medium-sized CSV files into an existing database. LOAD CSV can be run as many times as needed and does not require an empty database.

2. Preparing the dataset for bulk upload

Since the yelp dataset is in json format, we need to transform them into CSV files, and for this task we are going to use the following python script:

import csv
import json
import os
import hashlib

def write_header(file_name, columns):
    with open(file_name, 'w') as file_csv:
        writer = csv.writer(file_csv)
        writer.writerow(columns)

if __name__ == "__main__":
    print("JSON to CSV Data converter starting ...")
    if not os.path.isfile("data/business_header.csv"):
        with open("dataset/business.json") as business_json, \
                open("data/business.csv", 'w+') as business_csv:

            write_header("data/business_header.csv", ['id:ID(Business)', 'name', 'address', 'city', 'state'])

            business_writer = csv.writer(business_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            for line in business_json.readlines():
                item = json.loads(line)
                try:
                    business_writer.writerow(
                        [item['business_id'], item['name'], item['address'], item['city'], item['state']])
                except Exception as e:
                    print(item)
                    raise e
    print("business dataset converted ...")

    if not os.path.isfile("data/city_header.csv"):
        with open("dataset/business.json") as business_json, \
                open("data/city.csv", "w+") as city_csv, \
                open("data/business_IN_CITY_city.csv", "w") as business_city_csv:

            write_header("data/city_header.csv", ['name:ID(City)'])
            write_header("data/business_IN_CITY_city_header.csv", [':START_ID(Business)', ':END_ID(City)'])

            business_city_writer = csv.writer(business_city_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            city_writer = csv.writer(city_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            unique_cities = set()
            for line in business_json.readlines():
                item = json.loads(line)

                if item["city"].strip():
                    unique_cities.add(item["city"])
                    business_city_writer.writerow([item["business_id"], item["city"]])

            for city in unique_cities:
                city_writer.writerow([city])
    print("business -> city mapped ...")

    if not os.path.isfile("data/user_header.csv"):
        with open("dataset/user.json") as user_json, \
                open("data/user.csv", 'w+') as user_csv, \
                open("data/user_FRIENDS_user.csv", 'w+') as user_user_csv:

            write_header("data/user_header.csv", ['id:ID(User)', 'name'])
            write_header("data/user_FRIENDS_user_header.csv", [':START_ID(User)', ':END_ID(User)'])

            user_writer = csv.writer(user_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            user_user_writer = csv.writer(user_user_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            for line in user_json.readlines():
                item = json.loads(line)
                user_writer.writerow([item["user_id"], item["name"]])
                for friend_id in item["friends"]:
                    user_user_writer.writerow([item["user_id"], friend_id])
    print("User dataset converted ...")

    if not os.path.isfile("data/review_header.csv"):
        with open("dataset/review.json") as review_json, \
                open("data/review.csv", 'w+') as review_csv, \
                open("data/user_WROTE_review.csv", 'w+') as user_review_csv, \
                open("data/review_REVIEWS_business.csv", 'w+') as review_business_csv:

            write_header("data/review_header.csv", ['id:ID(Review)', 'text', 'stars:int', 'date'])
            write_header("data/user_WROTE_review_header.csv", [':START_ID(User)', ':END_ID(Review)'])
            write_header("data/review_REVIEWS_business_header.csv", [':START_ID(Review)', ':END_ID(Business)'])

            review_writer = csv.writer(review_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            user_review_writer = csv.writer(user_review_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            review_business_writer = csv.writer(review_business_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            for line in review_json.readlines():
                item = json.loads(line)
                review_writer.writerow([item["review_id"], item["text"], item["stars"], item["date"]])
                user_review_writer.writerow([item["user_id"], item["review_id"]])
                review_business_writer.writerow([item["review_id"], item["business_id"]])
    print("Review dataset converted ...  Game Over")

    if not os.path.isfile("data/tip_header.csv"):
        with open("dataset/tip.json") as tip_json, \
                open("data/tip.csv", 'w+') as tip_csv, \
                open("data/user_WROTE_tip.csv", 'w+') as user_tip_csv, \
                open("data/tip_ABOUT_business.csv", 'w+') as tip_business_csv:

            write_header("data/tip_header.csv", ['id:ID(tip)', 'text', 'stars:int', 'date',':LABEL'])
            write_header("data/user_WROTE_tip_header.csv", [':START_ID(User)', ':END_ID(tip)',':TYPE'])
            write_header("data/tip_ABOUT_business_header.csv", [':START_ID(tip)', ':END_ID(Business)',':TYPE'])

            tip_writer = csv.writer(tip_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            user_tip_writer = csv.writer(user_tip_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            tip_business_writer = csv.writer(tip_business_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            for line in tip_json.readlines():
                item = json.loads(line)
                item_text=hashlib.shake_256(item["text"].encode()).hexdigest(5)
                tip_writer.writerow([item_text, item["text"], item["date"],"Tip"])
                user_tip_writer.writerow([item["user_id"], item_text,"WROTE_TIP"])
                tip_business_writer.writerow([item_text, item["business_id"],"TIP_ABOUT"])
    print("Tip dataset converted ...  Game Over")

    if not os.path.isfile("data/area_header.csv"):
        with open("dataset/businessLocations.json") as business_locations_json, \
                open("data/area.csv", "w+") as area_csv, \
                open("data/country.csv", "w+") as country_csv, \
                open("data/city_IN_AREA_area.csv", "w+") as city_area_csv, \
                open("data/area_IN_COUNTRY_country.csv", "w+") as area_country_csv:
            input = json.load(business_locations_json)

            write_header("data/area_header.csv", ['name:ID(Area)'])
            write_header("data/country_header.csv", ['name:ID(Country)'])

            write_header("data/city_IN_AREA_area_header.csv", [':START_ID(City)', ':END_ID(Area)'])
            write_header("data/area_IN_COUNTRY_country_header.csv", [':START_ID(Area)', ':END_ID(Country)'])

            area_writer = csv.writer(area_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            country_writer = csv.writer(country_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            city_area_writer = csv.writer(city_area_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            area_country_writer = csv.writer(area_country_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            unique_areas = set()
            unique_countries = set()
            unique_city_areas = set()
            unique_area_countries = set()

            for business_id in input:
                if input[business_id]["admin1"]:
                    unique_areas.add(input[business_id]["admin1"])
                    unique_countries.add(input[business_id]["country"])

                    unique_city_areas.add((input[business_id]["city"], input[business_id]["admin1"]))
                    unique_area_countries.add((input[business_id]["admin1"], input[business_id]["country"]))

            for area in unique_areas:
                area_writer.writerow([area])

            for country in unique_countries:
                country_writer.writerow([country])

            for city, area in unique_city_areas:
                city_area_writer.writerow([city, area])

            for area, country in unique_area_countries:
                area_country_writer.writerow([area, country])
    print("Area Country dataset converted ...")

    if not os.path.isfile("data/category_header.csv"):
        with open("dataset/business.json") as business_json, \
                open("data/category.csv", 'w+') as categories_csv, \
                open("data/business_IN_CATEGORY_category.csv", 'w+') as business_category_csv:

            write_header("data/category_header.csv", ['name:ID(Category)'])
            write_header("data/business_IN_CATEGORY_category_header.csv", [':START_ID(Business)', ':END_ID(Category)'])

            business_category_writer = csv.writer(business_category_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)
            category_writer = csv.writer(categories_csv, escapechar='\\', quotechar='"', quoting=csv.QUOTE_ALL)

            unique_categories = set()
            for line in business_json.readlines():
                item = json.loads(line)
                try:
                    if item["categories"]:
                        chunks = item["categories"].split(', ')
                        for category in chunks:
                            unique_categories.add(category)
                            business_category_writer.writerow([item["business_id"], category])
                except ValueError as err:
                    print(err)



            for category in unique_categories:
                try:
                    category_writer.writerow([category])
                except Exception as e:
                    raise e
            print("business Category dataset converted ...")

It is important that files that defines nodes, have this format:

business_header.csv, notice we define the field id as ID of a node of type Business in neo4j.

id:ID(Business),name,address,city,state

business.csv, notice that the data is according to the colunm fields above, adding the field "Business" at end of each line.

"Pns2l4eNsfO8kk83dixA6A","Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2","Santa Barbara","CA","Business"
"mpf3x-BjTdTEA3yCZrAYPw","The UPS Store","87 Grasso Plaza Shopping Center","Affton","MO","Business"
"tUFrWirKiKi_TAnsVWINQQ","Target","5255 E Broadway Blvd","Tucson","AZ","Business"
"MTSW4McQd7CbVtyjqoe9mw","St Honore Pastries","935 Race St","Philadelphia","PA","Business"
"mWMc6_wTdE0EUBKIGXDVfA","Perkiomen Valley Brewery","101 Walnut St","Green Lane","PA","Business"

The same pattern for city_header.csv and city.csv

city_header.csv

name:ID(City)

city.csv

"Woolwich Twp.","City"
"McCarran","City"
"Berwyn","City"
"Mount Laurel Township","City"
"Huntingdon Valley PA","City"
"Newtown Square","City"

As for defining relationships, the format of the file should conform:

business_IN_CITY_header.csv, especifying the nodes types

:START_ID(Business),:END_ID(City)

business_IN_CITY.csv, having the ID of each node type according to the column, adding the name of the relationship, in the example below, "IN_CITY".

"Pns2l4eNsfO8kk83dixA6A","Santa Barbara","IN_CITY"
"mpf3x-BjTdTEA3yCZrAYPw","Affton","IN_CITY"
"tUFrWirKiKi_TAnsVWINQQ","Tucson","IN_CITY"
"MTSW4McQd7CbVtyjqoe9mw","Philadelphia","IN_CITY"
"mWMc6_wTdE0EUBKIGXDVfA","Green Lane","IN_CITY"
"CF33F8-E6oudUQ46HnavjQ","Ashland City","IN_CITY"

3. Neo4j latest version using docker-compose

Now that you have your files ready, you need to spin up neo4j. In this example we will spin up the latest version using docker. The docker-compose script below specify neo4j and a jupyter notebook:

version: '3.7'
services:
  neo4j:
    image: neo4j:latest
    container_name: 'neo-gds1.5'
    # build . only for use Dockerfile
    #build: .
    labels:
      description: Neo4j container with apoc and gds
      name: refreshed_neo4j
    ports:
      - '7474:7474'
      - '7473:7473'
      - '7687:7687'
    volumes:
      - $PWD/neo4j/data:/data
      # take care with permission
      # sudo chmod -R a+rwX ./data
      - $PWD/neo4j/import:/var/lib/neo4j/import
    environment:
      - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
      - NEO4J_AUTH=none
      - NEO4J_dbms_active__database=neo4.db
      # New version neo4j works with multiple databases. It will mount with a system database, but you can switch
      - NEO4JLABS_PLUGINS=["apoc", "graph-data-science"]
      - apoc.export.file.enabled=true
      - apoc.import.file.enabled=true
      #- NEO4J_dbms_memory_pagecache_size=1G
      #- NEO4J_dbms_memory_heap_initial__size=4G
      #- NEO4J_dbms_memory_heap_max__size=16G
    networks:
      - pyneo

  notebook:
    container_name: SNAP
    #image: snap-img
    build: .
    labels:
      description: Python3 with SNAP packages and Jupyter Notebooks
      name: jupyter notebook
    ports:
      - '8888:8888'
    volumes:
      - .:/usr/snap/notebooks
    # Run shell command for notebook on start
    entrypoint: jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root
    #runtime: "nvidia"
    networks:
      - pyneo

networks:
  pyneo:
    driver: 'bridge'

Dockerfile for the notebook:

# Docker image with SNAP network analysis tools
# http://snap.stanford.edu/snappy/

# Use an official Python runtime as a parent image
FROM python:3.9


MAINTAINER RS

# Set the working directory to /snap
WORKDIR /usr/snap

# Copy the current directory contents into the container at /snap
COPY requirements.txt /usr/snap

RUN pip install --upgrade pip --trusted-host pypi.python.org -r requirements.txt

and the requirements.txt for the notebook libraries:

jupyter
requests
snap-stanford
numpy
pandas
scipy
networkx
neonx
neo4j
tabulate

with that, run:

docker-compose up

and your should have the neo4j at port 7474 and jupiter notebook ate the URL specified at the container script. The username and passwod for neo4j is neo4j for both.

4. Bulk upload using neo4j-admin import

In a terminal, get into the neo4j container, using

docker exec -it <neo4jcontainername> bash

got to the import folder, in where the CSV folders are. This folder should be mounted according to the volumes specification in the docker-compose.yml file. Make shure the files are there from the side of your host.

Then run the neo4j-admin import as below:

neo4j-admin import --database=neo4.db --force --nodes=user.csv  --relationships=user_friends_header.csv,user_FRIEND_user.csv --nodes=business_header.csv,business.csv --nodes=category_header.csv,category.csv --relationships=business_IN_CATEGORY_category_header.csv,business_IN_CATEGORY_category.csv  --nodes=review_header.csv,review.csv --relationships=review_REVIEWS_business_header.csv,review_REVIEWS_business.csv --relationships=user_WROTE_review_header.csv,user_WROTE_review.csv --nodes=city_header.csv,city.csv --relationships=business_IN_CITY_city_header.csv,business_IN_CITY_city.csv --nodes=tip_header.csv,tip.csv --relationships=tip_ABOUT_business_header.csv,tip_ABOUT_business.csv --relationships=user_WROTE_tip_header.csv,user_WROTE_tip.csv --skip-duplicate-nodes=true --skip-bad-relationships --ignore-empty-strings=true --multiline-fields=true

You should see a script like this:

Selecting JVM - Version:11.0.15+10, Name:OpenJDK 64-Bit Server VM, Vendor:Oracle Corporation
Neo4j version: 4.4.6
Importing the contents of these files into /data/databases/neo4.db:
Nodes:
  /var/lib/neo4j/import/user.csv
  /var/lib/neo4j/import/business_header.csv
  /var/lib/neo4j/import/business.csv
  /var/lib/neo4j/import/category_header.csv
  /var/lib/neo4j/import/category.csv
  /var/lib/neo4j/import/review_header.csv
  /var/lib/neo4j/import/review.csv
  /var/lib/neo4j/import/city_header.csv
  /var/lib/neo4j/import/city.csv
  /var/lib/neo4j/import/tip_header.csv
  /var/lib/neo4j/import/tip.csv

Relationships:
  /var/lib/neo4j/import/user_friends_header.csv
  /var/lib/neo4j/import/user_FRIEND_user.csv
  /var/lib/neo4j/import/business_IN_CATEGORY_category_header.csv
  /var/lib/neo4j/import/business_IN_CATEGORY_category.csv
  /var/lib/neo4j/import/review_REVIEWS_business_header.csv
  /var/lib/neo4j/import/review_REVIEWS_business.csv
  /var/lib/neo4j/import/user_WROTE_review_header.csv
  /var/lib/neo4j/import/user_WROTE_review.csv
  /var/lib/neo4j/import/business_IN_CITY_city_header.csv
  /var/lib/neo4j/import/business_IN_CITY_city.csv
  /var/lib/neo4j/import/tip_ABOUT_business_header.csv
  /var/lib/neo4j/import/tip_ABOUT_business.csv
  /var/lib/neo4j/import/user_WROTE_tip_header.csv
  /var/lib/neo4j/import/user_WROTE_tip.csv


Available resources:
  Total machine memory: 7.774GiB
  Free machine memory: 4.707GiB
  Max heap memory : 1.945GiB
  Processors: 3
  Configured max memory: 5.246GiB
  High-IO: true

Type normalization:
  Property type of 'review_count' normalized from 'int' --> 'long' in /var/lib/neo4j/import/user.csv
  Property type of 'fans' normalized from 'int' --> 'long' in /var/lib/neo4j/import/user.csv

Import starting 2022-05-09 13:16:24.119+0000
  Estimated number of nodes: 10.03 M
  Estimated number of node properties: 68.36 M
  Estimated number of relationships: 131.24 M
  Estimated number of relationship properties: 229.24 M
  Estimated disk space usage: 14.27GiB
  Estimated required memory usage: 1.119GiB

(1/4) Node import 2022-05-09 13:16:24.129+0000
  Estimated number of nodes: 10.03 M
  Estimated disk space usage: 5.732GiB
  Estimated required memory usage: 1.119GiB
.......... .......... .......... .......... ..........   5% βˆ†9s 659ms
.......... .......... .......... .......... ..........  10% βˆ†19s 70ms
.......... .......... .......... .......... ..........  15% βˆ†25s 288ms
.......... .......... .......... .......... ..........  20% βˆ†23s 873ms
.......... .......... .......... .......... ..........  25% βˆ†24s 873ms
.......... .......... .......... .......... ..........  30% βˆ†26s 70ms
.......... .......... .......... ...-...... ..........  35% βˆ†266ms
.......... .......... .......... .......... ..........  40% βˆ†0ms
.......... .......... .......... .......... ..........  45% βˆ†1s 203ms
.......... .......... .......... .......... ..........  50% βˆ†1s 2ms
.......... .......... .......... .......... ..........  55% βˆ†1s
.......... .......... .......... .......... ..........  60% βˆ†802ms
.......... .......... .......... .......... ..........  65% βˆ†802ms
.......... .......... .......... .......... ..........  70% βˆ†201ms
.......... .......... .......... .......... ..........  75% βˆ†201ms
.......... .......... .......... .......... ..........  80% βˆ†202ms
.......... .......... .......... .......... ..........  85% βˆ†200ms
.......... .......... .......... .......... ..........  90% βˆ†1ms
.......... .......... .......... .......... ..........  95% βˆ†0ms
.......... .......... ......-... .......... .......... 100% βˆ†1s 529ms

Node import COMPLETED in 2m 21s 997ms

...

Post processing COMPLETED in 41s 611ms


IMPORT DONE in 38m 56s 412ms.
Imported:
  9982283 nodes
  31229043 relationships
  100244985 properties
Peak memory usage: 1.126GiB
There were bad entries which were skipped and logged into /var/lib/neo4j/import/import.report

5. Check it out in neo4j

To check it ou, log into neo4j at localhost:7474, select the neo4.db database and run a schema visualization.

// What is related, and how
CALL db.schema.visualization()

In a next article, we will start making some analysis in the dataset using neo4j cypher and also the jupyter notebook.


Loading comments...