Boost your data with LLMs and OpenAI embeddings - Dgraph Blog

Dgraph stores data as you think about it: a network of interconnected pieces of information, a Graph!

The recent advance of language models from OpenAI, provides us with interesting tools to make your data smarter, and automate tedious tasks such as data classification.

In this post we will show you how to use word embeddings in Dgraph to make your data smarter and have automatic classification. You can try it yourself in a few minutes, as Dgraph is also a fantastic data platform for Rapid Application development. For an overview, refer to the following video.

Semantic and word embeddings

Word embedding is an AI technique representing words and sentences in a very large vector space. Large Language Models (LLMs) encode words and other terms into vectors based on their context in sentences, based on training from a massive corpus. Following the distributional hypothesis stating that “Words which frequently appear in similar contexts have similar meaning”, words with similar meaning tend to have a similar ‘position’ in the vector space.

With this technique, we can transpose the question “do those two words or two sentences have the same meaning ?” into computing a vector similarity!

GPT models have democratized the usage of Large Language Models (LLM): they are pre-trained, (the P in Generalized Pre-trained Transformers) so you can use the models without going through the tedious process of creating them. Moreover ‘using a model’ can be as simple as invoking a REST service.

Note: when using a model you have not trained yourself, always have a look at the model quality, what it is supposed to be good at, and the potential training bias. For example, you should be aware of the social bias of openAI embeddings models.

Classification use case

Let’s consider the data model we have used in the Rapid Application development blog.

In this use case about donations to projects issued by public schools in the US, projects such as “Photography and Memories….Yearbook in the Works”, “Fund a Much Needed Acid Cabinet & Save Us from Corrosion!” have categories ( “Music & The Arts”, “Math & Science”, …).

The project category is usually selected by the teacher when creating the project. We want to remove this step and have the category infered automatically when a project is created.

Use Dgraph, GraphQL API, and Lambda webhook to implement auto classification

Hands on

If you want to try this out yourself, the easiest way to get Dgraph up and running is to signup for Dgraph Cloud and launch a free backend.

Data model

We will focus on Project and Category so we can work on a simplified model to experiment auto-classification.

We will tell Dgraph that we want to do some specific logic when a Project or a Category is added. We are using the custom directive @lambdaOnMutate:

type Project @lambdaOnMutate(add: true, update: false, delete: false) {
  id: ID!
  title: String!  @search(by: [term])
  grade: String @search(by: [hash])
  category: Category
}
type Category @lambdaOnMutate(add: true, update: false, delete: false) {
  id: ID!
  name: String!
}

Copy this schema in the Dgraph Cloud dashboard and deploy it.

Dgraph automatically created a GraphQL API to update and query Projects and Categories, and the @lambdaOnMutate directive means you can add additional code to run whenever there is an update (mutation). You will soon add the LLM integration in the “Dgraph Lambda” step below.

The API is up and running! We can test it without the auto-classification.

Let’s create “test” project by running a GraphQL mutation. Copy the request in the GraphQL explorer and run the query.

mutation MyMutation {
  addProject(input: {title: "Dgraph & OpenAI integration"}) {
    project {
      id
    }
  }
}

Access the Data Studio view to verify that we have one project created. It has a title but not category.

Project created without category

Open AI integration

OpenAI API

  • Go to OpenAI’s Platform website and sign in.
  • Click your profile icon at the top-right corner of the page and select “View API Keys.”
  • Click “Create New Secret Key” to generate a new API key.

Keep a copy of the key as we will use it to invoke OpenAI REST API from Dgraph.

Dgraph lambda

Dgraph lambda provides a way to write custom logic in JavaScript, integrate it with your GraphQL schema, and execute it using the GraphQL API.

We are using webhooks which are a specific type of Lambda, executed asynchronosouly after an add, delete, or update operation. Using the @lambdaOnMutate directive, we have already declared a lambda webhook on add operations for Category and Project in the GraphQL Schema.

Now we need to write the JS code, and add it to the GraphQL resolvers.

Auto classification logic

The auto classification logic is simple:

  • Every time a Category is created, compute an embedding for the category name and associate it with the Category node using the embedding predicate.
  • Every time a Project is created,
    • compute an embedding for the project title
    • retrieve all category’s embeddings and compute the cosine similarity between the title’s embedding and each category’s embedding.
    • Store each cosine similarity as a relationship called similarity between the project and the category, with an cosine property added as a facet on the relationship indicating how similar it is.
    • Use the more similar category to create a relationship Project.category as expected in the GraphQL Schema.
![](upload://zPEC16oTLf5cQDxDzwIQQ6oXMtD.gif) Finding the matching category

We are using interresting Dgraph features here:

  • The logic in a webhooks can update the graph.
  • The embedding and similarity relationships are not declared in the GraphQL schema. We are creating those relationships with Dgraph Query language. That means, in Dgraph, you can easily add “meta-data” or any type of information “on-top” of a graph generated by the GraphQL API.
  • We are saving the similarity with all Categories for a project. This is done to show that you can also save information to help in your logic: if a Category is removed, we can find the ’next’ closest Category without redoing queries to OpenAI, saving time and money. The case of deletion, is not covered in this Blog.

DQL predicates

In our logic we are adding some information in the graph in the form of predicates embedding and similarity.

Refer to the DQL Schema section of the documention.

We need to declare those two predicates: access the DQL Schema tab in your Dgraph Cloud dashboard, and click Add predicate. Add the predicate embedding of type String:

adding the embedding predicate

Do the same for the predicate similarity of type uid and declare it as a list: adding the embedding predicate

Lambda code

Here is the complete code of our lambda.

Copy this code, set you OpenAI API key in the"Authorization": "Bearer " line.

Paste the code with you OpenAI API key in the Script section of Dgraph lambda configuration and save.

function dotProduct(v,w) {
   return v.reduce((l,r,i)=>l+r*w[i],0)
   // as openapi embedding vectors are normalized
   // dot product = cosine similarity
}
async function mutateRDF(dql,rdfs) {
  //   
  if (rdfs !== "") {
        return dql.mutate(`{
                set {
                    ${rdfs}
                }
            }`)
    }
}
async function embedding(text) {
  let url = `https://api.openai.com/v1/embeddings`;
  let response = await fetch(url,{
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer <--- replace by your OpenAI API key ----->"
    },
    body: `{ "input": "${text}", "model": "text-embedding-ada-002" }`
  })
  let data = await response.json();
  console.log(`embedding = ${data.data[0].embedding}`)
  return data.data[0].embedding;
}
async function addProjectWebhook({event, dql, graphql, authHeader}) {
  
  const categoriesData = await dql.query(`{ 
        categories(func:type(Category))   {
          uid 
          name:Category.name
          embedding
        }
      }`)
  for (let c of categoriesData.data.categories ) {
       c.vector = JSON.parse(c.embedding);
  }
  var rdfs = "";
  for (let i = 0; i < event.add.rootUIDs.length; ++i ) {
    console.log(`adding embedding for Project ${event.add.rootUIDs[i]} ${event.add.input[i]['title']}`)
    var uid = event.add.rootUIDs[i];
    const v1 = await embedding(event.add.input[i].title);
    const serialized = JSON.stringify(v1);
    if  (event.add.input[i]['category'] == undefined) { 
       
      let category="";
      let max = 0.0;
      let similarityMutation = "";
      for (let c of categoriesData.data.categories ) {
        const similarity = dotProduct(v1,c.vector);
        similarityMutation += `<${uid}>  <similarity> <${c.uid}> (cosine=${similarity}) .\n`;
        if (similarity > max) {
          category = c.uid;
          max = similarity;
        }
      }
      console.log(`set closest category`) 
      rdfs += `${similarityMutation}
              <${uid}>  <embedding> "${serialized}" .
              <${uid}> <Project.category> <${category}> .
                `;
    } else {
      console.log(`Project ${event.add.rootUIDs[i]} added with category ${event.add.input[i]['category'].name}`)
      rdfs += `<${uid}>  <embedding> "${serialized}" .
                `;
    }
  }
  await mutateRDF(dql,rdfs);  
  
}
async function addCategoryWebhook({event, dql, graphql, authHeader}) {
    var rdfs = "";
    // webhook may receive an array of UIDs
    // apply the same logic for each node
    for (let i = 0; i < event.add.rootUIDs.length; ++i ) {
        console.log(`adding embedding for ${event.add.rootUIDs[i]} ${event.add.input[i]['name']}`)
        const uid = event.add.rootUIDs[i];
        // retrieve the embedding for the category name
        const data = await embedding(event.add.input[i]['name']);
        const serialized = JSON.stringify(data);
        // create a tripple to associate the embedding to the category using the predicate <embedding>
        rdfs += `<${uid}>  <embedding> "${serialized}" .
        `;
    }
    // use a single mutation to save all the embeddings 
    await mutateRDF(dql,rdfs); 
}
self.addWebHookResolvers({
   "Project.add": addProjectWebhook,
   "Category.add": addCategoryWebhook
})
   

Let’s examine what this code is doing:

self.addWebHookResolvers({
   "Project.add": addProjectWebhook,
   "Category.add": addCategoryWebhook
})

Registers the JS functions for each operations declared in the @lambdaOnMutate directives.

function dotProduct(v,w) {
   return v.reduce((l,r,i)=>l+r*w[i],0)
   // as openapi embedding vectors are normalized
   // dot product = cosine similarity
}

An elegant way to compute a dot product!

async function embedding(text) {
  let url = `https://api.openai.com/v1/embeddings`;
  let response = await fetch(url,{
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer <--- replace by your OpenAI API key ----->"
    },
    body: `{ "input": "${text}", "model": "text-embedding-ada-002" }`
  })
  let data = await response.json();
  console.log(`embedding = ${data.data[0].embedding}`)
  return data.data[0].embedding;
}

Retrieves the embedding for a given text using OpenAI /v1/embeddings API and text-embedding-ada-002 model.

That’s where you have to set you OpenAI API key.

async function mutateRDF(dql,rdfs) {
  if (rdfs !== "") {
        return dql.mutate(`{
                set {
                    ${rdfs}
                }
            }`)
    }
}

Is an helper function to execute a mutation and save the provided RDFs. We are using dql which is an helper object provided to the webhook.

async function addCategoryWebhook({event, dql, graphql, authHeader}) {
    var rdfs = "";
    // webhook may receive an array of UIDs
    // apply the same logic for each node
    for (let i = 0; i < event.add.rootUIDs.length; ++i ) {
        console.log(`adding embedding for ${event.add.rootUIDs[i]} ${event.add.input[i]['name']}`)
        const uid = event.add.rootUIDs[i];
        // retrieve the embedding for the category name
        const data = await embedding(event.add.input[i]['name']);
        const serialized = JSON.stringify(data);
        // create a tripple to associate the embedding to the category using the predicate <embedding>
        rdfs += `<${uid}>  <embedding> "${serialized}" .
        `;
    }
    // use a single mutation to save all the embeddings 
    await mutateRDF(dql,rdfs); 
}

addCategoryWebhook applies our logic when a new Category is added. The Webhook may be invoked with an array of add events. We simply compute the embedding for each Category name added and create an RDF to save this information.

addProjectWebhook is computing the embedding of the project title, the similarity to all the categories and set the project’s category to the most similar.

Testing

We can now add some categories using the GraphQL API generated by Dgraph from the GraphQL schema.

Doing so, Dgraph will automatically associate a semantic representation ( the embedding) to the new categories.

You can use any GraphQL client with the GraphQL endpoint found on the Cloud dashboad

We are just using the GraphQL explorer, paste the following mutation and run it:

mutation addCategory($name: String!) {
  addCategory(input: {name: $name}) {
    category {
      id
      name
    }
  }
}

Paste the following JSON in the variables section

{"name":"Math & Science"}
adding a category

Re-run the mutation for different category names

  • Music & The Arts
  • Health & Sports
  • History & Civics
  • Literacy & Language

We can now verify that Dgraph has added embedding information to every category. embedding is not exposed through our GraphQL schema, so we must use Dgraph Query Language (DQL) to directly read the database.

Copy-paste the following DQL query in the DQL section of the dashboard, and execute it.

{ 
   categories(func:type(Category)) {
      uid
      name:Category.name
      embedding
   }
}
categories have an embedding

You can see that each category has an embedding.

Returning to the GraphQL explorer, paste the following mutation and run it to create a Project

mutation AddProject($title: String!) {
  addProject(input: {title: $title}) {
    project {
      id
    }
  }
}

with the variables

{"title":"Fund a Much Needed Acid Cabinet & Save Us from Corrosion!"}
add a project

In the Data Studio you can see that your project has been created and that it has a Category.

The Category automatically selected for this project is “Math & Science” in our case.

You can also run a GraphQL query

query MyQuery {
  queryProject(first: 10) {
    title
    category {
      name
    }
  }
}
projects with automatically associated category

Conclusion

With the ease of use of GraphQL API generation and the power of javascript custom resolvers (Dgraph lambda), boosting your graph data with AI is an easy task in Dgraph.

In this Blog, we showed how to use OpenAI API to compute word embeddings when data is added to Dgraph, to use the embeddings to evaluate semantic similarity between project’s title and category’s name, and finally to automatically create the correct relationships between projects and categories.

Photo by Pixabay


This is a companion discussion topic for the original entry at https://dgraph.io/blog/post/20230602-ai-classification/
1 Like

Experiment yourself with a python notebook and a video showing the jupyter notebook in action.

Continue with the post Dgraph and Vector database - the best of two worlds.

1 Like

Hi,

I really appreciate this demonstration! It’s been quite helpful.

However, I do have a couple of questions:

  1. I noticed that you set a facet for similarity in this example, but it doesn’t appear to be used. Would it be possible to remove it from the example and simplify the case?

  2. I’m curious about why you chose RDF mutations instead of Graphql mutations for setting the embedding and category. Is there a specific reason behind this decision?

Thank you.

Thanks for the feedback.
I included facets to emphasize the fact that you can use attributes on relationships, which is not always known as some other graph DBs do not support this capability.
I’m not using the facets right now in this example, but I think it’s interesting to know that you can store the score or weight of the relationship. One of the use case I have in mind is to handle “delete” event on a category. On a delete we could use the pre-calculated similarities with their score to re-establish the new category for projects.

I’m using RDF mutation because the embedding is not exposed as a GraphQL attribute. It’s kind of a lower level information. I wanted to show that you can have a simple API exposed to your clients and manage more complex related data in DQL. The GraphQL - DQL interoperability is opening interesting use cases. We still have to set some safeguards though. We are working on it.

Thank you, that’s what I thought.

Could the facet be also used with a sort query in order to get the “more similars” categories of an item directly ?

Facets better serve for meta data tgat is not really needed much for filtering. They are not forst class citizens of the graph so you lose performance if you try to use them for filtering.

Do you mean, that it would be faster to store; let’s say the weights directly as predicates of some edges and sort on those weight predicates instead of sorting using facets ?

This conversation can relate to this recent post. See my answer for an example of sort filter using facets : List predicate order

Probably like everything else with Dgraph, “it depends on your dataset” for specifically what will be better. But in terms of generality with Dgraph, predicates will usually always be more performant than facets.

2 Likes