Representing Hierarchies in Dgraph: Data Conversion using Apache Spark

Overview

Dgraph can hold different kinds of structured information. Hierarchies are a class of structured information that are easily handled by Dgraph. Hierarchies are extremely important for certain business aspects, such as publishing, intuitive exploration, and reporting.

In an online marketplace, products are mapped to their respective location in a hierarchy or taxonomy, which helps users find the products they are looking for easily. For example, you can traverse the “Shop By Category” feature in Amazon to identify a set of products of your interest. The Categories are organized by an easy to traverse hierarchy.

In this post, we will review how a comma separated file containing a sample hierarchy can be transformed and loaded into Dgraph. The sample hierarchy being explored is the 2-6 digit code file from the “North American Industry Classification System” or NAICS.

NAICS Code

NAICS is organized into a hierarchy that can reach upto five levels. These five levels are summarized in the table below.

Level Description Specification
1 Sector 2-digit code
2 Subsector 3-digit code
3 Industry Group 4-digit code
4 NAICS Industry 5-digit code
5 National Industry 6-digit code

Each code is self describing and can be intuitively understood in terms of a hierarchy. In the example below, the code for “Motorcycle, ATV, and Personal Watercraft Dealers”, 441221, is 6 digits long. It is quite easy to traverse to the immediate parent by simply dropping the last digit, which results in the code “44122”, “Motorcycle, Boat, and Other Motor Vehicle Dealers”. Further traversal can be continued in the same method.

The 2-6 digit code file from the “North American Industry Classification System” or NAICS can be freely downloaded from this link : https://www.census.gov/eos/www/naics/2017NAICS/2-6%20digit_2017_Codes.xlsx

A screenshot of the data looks as below. As can be seen, the data is provided as a set of node-title pairs. We will need to transform this data into a scheme suitable for uploading and traversing into Dgraph.

You can explore the hierarchy online at this link: https://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart_code=11&search=2017%20NAICS%20Search

Dgraph Schema

The Dgraph schema can simply consist of a code, title and a reversible link to the parent. The code is indexed with hash while the title is given can be given full-text , and trigram indexes for easier searches. The schema is as below.

<naics.code>: string @index(hash) .
<naics.parent>: uid @reverse .
<naics.title>: string @index(fulltext, trigram) .

Data Transformation

Now that we understand the structure of the source data and target, we can begin activities of loading, transforming and ultimately importing data into Dgraph. We have chosen Apache Spark, a popular tool for data transformation. In particular, we will be using features of Apache DataFrame and SQL. Let’s begin.

Preparation

The data can be loaded using this command. The csv file is loaded into a dataframe “df”. For easier readability, the names of the three columns were renamed to “Sequence”, “NAICS_US_CODE” and “NAICS_US_TITLE” respectively.
Note: Please make sure that any unnecessary blank rows or columns are removed.

val fileName="/home/ReferenceData/naics.csv"
val df= spark.read.option("header",true).csv(fileName)

The “printSchema” command on “df” yields the following result, confirming that the data has been loaded. The “show” command displays the first 20 records.

df.printSchema

root
 |-- Sequence: string (nullable = true)
 |-- NAICS_US_CODE: string (nullable = true)
 |-- NAICS_US_TITLE: string (nullable = true)

Here is a look of the data loaded from the csv file.

df.show(false)
+--------+-------------+------------------------------------------+
|Sequence|NAICS_US_CODE|NAICS_US_TITLE                            |
+--------+-------------+------------------------------------------+
|1       |11           |Agriculture, Forestry, Fishing and Hunting|
|2       |111          |Crop Production                           |
|3       |1111         |Oilseed and Grain Farming                 |
|4       |11111        |Soybean Farming                           |
|5       |111110       |Soybean Farming                           |
|6       |11112        |Oilseed (except Soybean) Farming          |
|7       |111120       |Oilseed (except Soybean) Farming          |
|8       |11113        |Dry Pea and Bean Farming                  |
|9       |111130       |Dry Pea and Bean Farming                  |
|10      |11114        |Wheat Farming                             |
|11      |111140       |Wheat Farming                             |
|12      |11115        |Corn Farming                              |
|13      |111150       |Corn Farming                              |
|14      |11116        |Rice Farming                              |
|15      |111160       |Rice Farming                              |
|16      |11119        |Other Grain Farming                       |
|17      |111191       |Oilseed and Grain Combination Farming     |
|18      |111199       |All Other Grain Farming                   |
|19      |1112         |Vegetable and Melon Farming               |
|20      |11121        |Vegetable and Melon Farming               |
+--------+-------------+------------------------------------------+
only showing top 20 rows

Join on NAICS_US_CODE

Extracting Parent Codes (“part of code”)
We have seen that dropping the last digit helps in picking the parent code. We will create this truncated field with the name “partOfCode” as below. The “substring” function is used to drop the last digit.

val dfWithPOC=df.withColumn("partOfCode", expr("substring(NAICS_US_CODE, 1, length(NAICS_US_CODE)-1)"))

Joining “partofCode” with “NAICS_US_CODE”

We can now join the dataframe “dfWithPOC” with itself to yield each node with its parent code. This format is suitable for importing into Dgraph. We have renamed some of the columns as part of the join instruction to eliminate duplicate column names resulting from the join.
Each line in the joined view contains a child-parent relationship. For example, the parent of “All Other Grain F…” with code “111199” is “Other Grain Farming” with code “11119”.
Note: The parent of the 2-digit codes will have “null” in them, as there is no root currently. This node will still end up in Dgraph and will serve as root.

val dfJoined=dfWithPOC.as("v1").select('Sequence as "SeqLHS", 'NAICS_US_CODE as "ChildCode",'partOfCode as "code", 'NAICS_US_TITLE as "childTitle").join(dfWithPOC.as("v2"), $"code" === $"v2.NAICS_US_CODE", "left_outer")
dfJoined.show(false)
+------+---------+-----+------------------------------------------+--------+-------------+------------------------------------------+----------+
|SeqLHS|ChildCode|code |childTitle                                |Sequence|NAICS_US_CODE|NAICS_US_TITLE                            |partOfCode|
+------+---------+-----+------------------------------------------+--------+-------------+------------------------------------------+----------+
|1     |11       |1    |Agriculture, Forestry, Fishing and Hunting|null    |null         |null                                      |null      |
|2     |111      |11   |Crop Production                           |1       |11           |Agriculture, Forestry, Fishing and Hunting|1         |
|3     |1111     |111  |Oilseed and Grain Farming                 |2       |111          |Crop Production                           |11        |
|4     |11111    |1111 |Soybean Farming                           |3       |1111         |Oilseed and Grain Farming                 |111       |
|5     |111110   |11111|Soybean Farming                           |4       |11111        |Soybean Farming                           |1111      |
|6     |11112    |1111 |Oilseed (except Soybean) Farming          |3       |1111         |Oilseed and Grain Farming                 |111       |
|7     |111120   |11112|Oilseed (except Soybean) Farming          |6       |11112        |Oilseed (except Soybean) Farming          |1111      |
|8     |11113    |1111 |Dry Pea and Bean Farming                  |3       |1111         |Oilseed and Grain Farming                 |111       |
|9     |111130   |11113|Dry Pea and Bean Farming                  |8       |11113        |Dry Pea and Bean Farming                  |1111      |
|10    |11114    |1111 |Wheat Farming                             |3       |1111         |Oilseed and Grain Farming                 |111       |
|11    |111140   |11114|Wheat Farming                             |10      |11114        |Wheat Farming                             |1111      |
|12    |11115    |1111 |Corn Farming                              |3       |1111         |Oilseed and Grain Farming                 |111       |
|13    |111150   |11115|Corn Farming                              |12      |11115        |Corn Farming                              |1111      |
|14    |11116    |1111 |Rice Farming                              |3       |1111         |Oilseed and Grain Farming                 |111       |
|15    |111160   |11116|Rice Farming                              |14      |11116        |Rice Farming                              |1111      |
|16    |11119    |1111 |Other Grain Farming                       |3       |1111         |Oilseed and Grain Farming                 |111       |
|17    |111191   |11119|Oilseed and Grain Combination Farming     |16      |11119        |Other Grain Farming                       |1111      |
|18    |111199   |11119|All Other Grain Farming                   |16      |11119        |Other Grain Farming                       |1111      |
|19    |1112     |111  |Vegetable and Melon Farming               |2       |111          |Crop Production                           |11        |
|20    |11121    |1112 |Vegetable and Melon Farming               |19      |1112         |Vegetable and Melon Farming               |111       |
+------+---------+-----+------------------------------------------+--------+-------------+------------------------------------------+----------+
only showing top 20 rows

Creating Mutations from Apache Spark

We can use the dataset API to form triples that can be loaded into Dgraph.

dfJoined.select(explode(array(
  concat(lit("_:C"), $"ChildCode", lit(" <naics.code> \""), $"ChildCode", lit("\" .")),
  concat(lit("_:C"), $"ChildCode", lit(" <naics.title> \""), $"childTitle", lit("\" .")),
  concat(lit("_:C"), $"ChildCode", lit(" <naics.parent> _:C"), $"NAICS_US_CODE", lit(" ."))
)).as("triple"))
  .where($"triple".isNotNull)
  .write
  .text("naicsConvertedTriples.txt")

The first few lines of the file are as follows. These mutations are then applied directly on Ratel.

_:C111 <naics.code> "111" .
_:C111 <naics.title> "Crop Production" .
_:C111 <naics.parent> "_:C11" .

_:C1111 <naics.code> "1111" .
_:C1111 <naics.title> "Oilseed and Grain Farming" .
_:C1111 <naics.parent> "_:C111" .
...(truncated)

Import into Dgraph using Ratel

As mentioned above, we can apply the mutations directly into Dgraph via Ratel. A part of the query is shown below.

{
  set{
    _:C11 <naics.code> "11" .
    _:C11 <naics.title> "Agriculture, Forestry, Fishing and Hunting" .
    _:C11 <naics.parent> _:Cnull .
    _:C111 <naics.code> "111" .
    _:C111 <naics.title> "Crop Production" .
    _:C111 <naics.parent> _:C11 .
    _:C1111 <naics.code> "1111" .
    _:C1111 <naics.title> "Oilseed and Grain Farming" .
    _:C1111 <naics.parent> _:C111 .
    _:C11111 <naics.code> "11111" .
    _:C11111 <naics.title> "Soybean Farming" .
    …(truncated)…
  }
}

We have now transformed the source file and these triples can be uploaded into Dgraph! We can now move on and explore the created hierarchy in Dgraph.

Exploration in Dgraph

Viewing the Hierarchy

Let’s explore “Crop Production” categories via the following query.

{
  naics(func: eq(naics.title,"Fruit and Tree Nut Farming")) @recurse{
    naics.code
    name: naics.title
    child: ~naics.parent
  }
}

Summary

In this showcase, we saw how we can transform the NAICS hierarchy and load into Dgraph. Annotating your product database with the appropriate NAICS hierarchy node allows easier visibility and aids in several business processes, such as search, analytics, and so on.

Special thanks to @EnricoMi for his feedback for this showcase. Do checkout @EnricoMi’s spark data connector for Dgraph. This connector allows you to extract and load data from Dgraph directly into Apache Spark, and helps you execute deep analytics on Apache Spark.

3 Likes