The recent advance of language models from OpenAI, provides us with interesting tools to make your data smarter, and automate tedious tasks such as data classification.
In this post we will show you how to use word embeddings in Dgraph to make your data smarter and have automatic classification. You can try it yourself in a few minutes, as Dgraph is also a fantastic data platform for Rapid Application development. For an overview, refer to the following video.
Semantic and word embeddings
Word embedding is an AI technique representing words and sentences in a very large vector space. Large Language Models (LLMs) encode words and other terms into vectors based on their context in sentences, based on training from a massive corpus. Following the distributional hypothesis stating that âWords which frequently appear in similar contexts have similar meaningâ, words with similar meaning tend to have a similar âpositionâ in the vector space.
With this technique, we can transpose the question âdo those two words or two sentences have the same meaning ?â into computing a vector similarity!
GPT models have democratized the usage of Large Language Models (LLM): they are pre-trained, (the P in Generalized Pre-trained Transformers) so you can use the models without going through the tedious process of creating them. Moreover âusing a modelâ can be as simple as invoking a REST service.
Note: when using a model you have not trained yourself, always have a look at the model quality, what it is supposed to be good at, and the potential training bias. For example, you should be aware of the social bias of openAI embeddings models.
Classification use case
Letâs consider the data model we have used in the Rapid Application development blog.
In this use case about donations to projects issued by public schools in the US, projects such as âPhotography and MemoriesâŠ.Yearbook in the Worksâ, âFund a Much Needed Acid Cabinet & Save Us from Corrosion!â have categories ( âMusic & The Artsâ, âMath & Scienceâ, âŠ).
The project category is usually selected by the teacher when creating the project. We want to remove this step and have the category infered automatically when a project is created.
Use Dgraph, GraphQL API, and Lambda webhook to implement auto classification
Hands on
If you want to try this out yourself, the easiest way to get Dgraph up and running is to signup for Dgraph Cloud and launch a free backend.
Data model
We will focus on Project
and Category
so we can work on a simplified model to experiment auto-classification.
We will tell Dgraph that we want to do some specific logic when a Project
or a Category
is added. We are using the custom directive @lambdaOnMutate:
type Project @lambdaOnMutate(add: true, update: false, delete: false) {
id: ID!
title: String! @search(by: [term])
grade: String @search(by: [hash])
category: Category
}
type Category @lambdaOnMutate(add: true, update: false, delete: false) {
id: ID!
name: String!
}
Copy this schema in the Dgraph Cloud dashboard and deploy it.
Dgraph automatically created a GraphQL API to update and query Projects and Categories, and the @lambdaOnMutate
directive means you can add additional code to run whenever there is an update (mutation). You will soon add the LLM integration in the âDgraph Lambdaâ step below.
The API is up and running! We can test it without the auto-classification.
Letâs create âtestâ project by running a GraphQL mutation. Copy the request in the GraphQL explorer and run the query.
mutation MyMutation {
addProject(input: {title: "Dgraph & OpenAI integration"}) {
project {
id
}
}
}
Access the Data Studio view to verify that we have one project created. It has a title but not category.
Project created without categoryOpen AI integration
OpenAI API
- Go to OpenAIâs Platform website and sign in.
- Click your profile icon at the top-right corner of the page and select âView API Keys.â
- Click âCreate New Secret Keyâ to generate a new API key.
Keep a copy of the key as we will use it to invoke OpenAI REST API from Dgraph.
Dgraph lambda
Dgraph lambda provides a way to write custom logic in JavaScript, integrate it with your GraphQL schema, and execute it using the GraphQL API.
We are using webhooks which are a specific type of Lambda, executed asynchronosouly after an add, delete, or update operation. Using the @lambdaOnMutate
directive, we have already declared a lambda webhook on add
operations for Category
and Project
in the GraphQL Schema.
Now we need to write the JS code, and add it to the GraphQL resolvers.
Auto classification logic
The auto classification logic is simple:
- Every time a Category is created, compute an embedding for the category name and associate it with the Category node using the
embedding
predicate. - Every time a Project is created,
- compute an embedding for the project title
- retrieve all categoryâs embeddings and compute the
cosine similarity
between the titleâs embedding and each categoryâs embedding. - Store each
cosine similarity
as a relationship calledsimilarity
between the project and the category, with ancosine
property added as a facet on the relationship indicating how similar it is. - Use the
more similar
category to create a relationshipProject.category
as expected in the GraphQL Schema.
We are using interresting Dgraph features here:
- The logic in a webhooks can update the graph.
- The
embedding
andsimilarity
relationships are not declared in the GraphQL schema. We are creating those relationships with Dgraph Query language. That means, in Dgraph, you can easily add âmeta-dataâ or any type of information âon-topâ of a graph generated by the GraphQL API. - We are saving the similarity with all Categories for a project. This is done to show that you can also save information to help in your logic: if a Category is removed, we can find the ânextâ closest Category without redoing queries to OpenAI, saving time and money. The case of deletion, is not covered in this Blog.
DQL predicates
In our logic we are adding some information in the graph in the form of predicates embedding
and similarity
.
Refer to the DQL Schema section of the documention.
We need to declare those two predicates: access the DQL Schema tab in your Dgraph Cloud dashboard, and click Add predicate.
Add the predicate embedding
of type String
:
Do the same for the predicate similarity
of type uid
and declare it as a list
:
adding the embedding predicate
Lambda code
Here is the complete code of our lambda.
Copy this code, set you OpenAI API key in the"Authorization": "Bearer "
line.
Paste the code with you OpenAI API key in the Script section of Dgraph lambda configuration and save.
function dotProduct(v,w) {
return v.reduce((l,r,i)=>l+r*w[i],0)
// as openapi embedding vectors are normalized
// dot product = cosine similarity
}
async function mutateRDF(dql,rdfs) {
//
if (rdfs !== "") {
return dql.mutate(`{
set {
${rdfs}
}
}`)
}
}
async function embedding(text) {
let url = `https://api.openai.com/v1/embeddings`;
let response = await fetch(url,{
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer <--- replace by your OpenAI API key ----->"
},
body: `{ "input": "${text}", "model": "text-embedding-ada-002" }`
})
let data = await response.json();
console.log(`embedding = ${data.data[0].embedding}`)
return data.data[0].embedding;
}
async function addProjectWebhook({event, dql, graphql, authHeader}) {
const categoriesData = await dql.query(`{
categories(func:type(Category)) {
uid
name:Category.name
embedding
}
}`)
for (let c of categoriesData.data.categories ) {
c.vector = JSON.parse(c.embedding);
}
var rdfs = "";
for (let i = 0; i < event.add.rootUIDs.length; ++i ) {
console.log(`adding embedding for Project ${event.add.rootUIDs[i]} ${event.add.input[i]['title']}`)
var uid = event.add.rootUIDs[i];
const v1 = await embedding(event.add.input[i].title);
const serialized = JSON.stringify(v1);
if (event.add.input[i]['category'] == undefined) {
let category="";
let max = 0.0;
let similarityMutation = "";
for (let c of categoriesData.data.categories ) {
const similarity = dotProduct(v1,c.vector);
similarityMutation += `<${uid}> <similarity> <${c.uid}> (cosine=${similarity}) .\n`;
if (similarity > max) {
category = c.uid;
max = similarity;
}
}
console.log(`set closest category`)
rdfs += `${similarityMutation}
<${uid}> <embedding> "${serialized}" .
<${uid}> <Project.category> <${category}> .
`;
} else {
console.log(`Project ${event.add.rootUIDs[i]} added with category ${event.add.input[i]['category'].name}`)
rdfs += `<${uid}> <embedding> "${serialized}" .
`;
}
}
await mutateRDF(dql,rdfs);
}
async function addCategoryWebhook({event, dql, graphql, authHeader}) {
var rdfs = "";
// webhook may receive an array of UIDs
// apply the same logic for each node
for (let i = 0; i < event.add.rootUIDs.length; ++i ) {
console.log(`adding embedding for ${event.add.rootUIDs[i]} ${event.add.input[i]['name']}`)
const uid = event.add.rootUIDs[i];
// retrieve the embedding for the category name
const data = await embedding(event.add.input[i]['name']);
const serialized = JSON.stringify(data);
// create a tripple to associate the embedding to the category using the predicate <embedding>
rdfs += `<${uid}> <embedding> "${serialized}" .
`;
}
// use a single mutation to save all the embeddings
await mutateRDF(dql,rdfs);
}
self.addWebHookResolvers({
"Project.add": addProjectWebhook,
"Category.add": addCategoryWebhook
})
Letâs examine what this code is doing:
self.addWebHookResolvers({
"Project.add": addProjectWebhook,
"Category.add": addCategoryWebhook
})
Registers the JS functions for each operations declared in the @lambdaOnMutate
directives.
function dotProduct(v,w) {
return v.reduce((l,r,i)=>l+r*w[i],0)
// as openapi embedding vectors are normalized
// dot product = cosine similarity
}
An elegant way to compute a dot product!
async function embedding(text) {
let url = `https://api.openai.com/v1/embeddings`;
let response = await fetch(url,{
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer <--- replace by your OpenAI API key ----->"
},
body: `{ "input": "${text}", "model": "text-embedding-ada-002" }`
})
let data = await response.json();
console.log(`embedding = ${data.data[0].embedding}`)
return data.data[0].embedding;
}
Retrieves the embedding for a given text using OpenAI /v1/embeddings
API and text-embedding-ada-002
model.
Thatâs where you have to set you OpenAI API key.
async function mutateRDF(dql,rdfs) {
if (rdfs !== "") {
return dql.mutate(`{
set {
${rdfs}
}
}`)
}
}
Is an helper function to execute a mutation and save
the provided RDFs. We are using dql
which is an helper object provided to the webhook.
async function addCategoryWebhook({event, dql, graphql, authHeader}) {
var rdfs = "";
// webhook may receive an array of UIDs
// apply the same logic for each node
for (let i = 0; i < event.add.rootUIDs.length; ++i ) {
console.log(`adding embedding for ${event.add.rootUIDs[i]} ${event.add.input[i]['name']}`)
const uid = event.add.rootUIDs[i];
// retrieve the embedding for the category name
const data = await embedding(event.add.input[i]['name']);
const serialized = JSON.stringify(data);
// create a tripple to associate the embedding to the category using the predicate <embedding>
rdfs += `<${uid}> <embedding> "${serialized}" .
`;
}
// use a single mutation to save all the embeddings
await mutateRDF(dql,rdfs);
}
addCategoryWebhook
applies our logic when a new Category is added. The Webhook may be invoked with an array of add
events. We simply compute the embedding
for each Category name added and create an RDF to save this information.
addProjectWebhook
is computing the embedding of the project title, the similarity
to all the categories and set the projectâs category to the most similar.
Testing
We can now add some categories using the GraphQL API generated by Dgraph from the GraphQL schema.
Doing so, Dgraph will automatically associate a semantic representation ( the embedding
) to the new categories.
You can use any GraphQL client with the GraphQL endpoint found on the Cloud dashboad
We are just using the GraphQL explorer, paste the following mutation and run it:
mutation addCategory($name: String!) {
addCategory(input: {name: $name}) {
category {
id
name
}
}
}
Paste the following JSON in the variables section
{"name":"Math & Science"}
Re-run the mutation for different category names
- Music & The Arts
- Health & Sports
- History & Civics
- Literacy & Language
We can now verify that Dgraph has added embedding
information to every category. embedding
is not exposed through our GraphQL schema, so we must use Dgraph Query Language (DQL) to directly read the database.
Copy-paste the following DQL query in the DQL section of the dashboard, and execute it.
{
categories(func:type(Category)) {
uid
name:Category.name
embedding
}
}
You can see that each category has an embedding.
Returning to the GraphQL explorer, paste the following mutation and run it to create a Project
mutation AddProject($title: String!) {
addProject(input: {title: $title}) {
project {
id
}
}
}
with the variables
{"title":"Fund a Much Needed Acid Cabinet & Save Us from Corrosion!"}
In the Data Studio you can see that your project has been created and that it has a Category.
The Category automatically selected for this project is âMath & Scienceâ in our case.
You can also run a GraphQL query
query MyQuery {
queryProject(first: 10) {
title
category {
name
}
}
}
Conclusion
With the ease of use of GraphQL API generation and the power of javascript custom resolvers (Dgraph lambda), boosting your graph data with AI is an easy task in Dgraph.
In this Blog, we showed how to use OpenAI API to compute word embeddings when data is added to Dgraph, to use the embeddings to evaluate semantic similarity between projectâs title and categoryâs name, and finally to automatically create the correct relationships between projects and categories.
Photo by Pixabay
This is a companion discussion topic for the original entry at https://dgraph.io/blog/post/20230602-ai-classification/