Dgraph 24.0.0-alpha is now available on Github and DockerHub

Dgraph v24.0.0-alpha is available now for the community to try out the support for vector data type which enables semantic search.

Dgraph is adding vector support to combine graph data with embeddings, enhancing Graph-based applications and unlocking new AI capabilities. Core graph use cases like fraud detection, recommendations, and master data management can all be supercharged by vectors and embeddings. Graph+Vector is also a key technique used to reduce hallucinations within AI-augmented applications.

This release also includes some performance enhancements and maintenance bug fixes to improve the stability of the database engine.

Key highlights of the release include:

  • Support for a native vector type at the DQL level
  • Extend Liveloader to work with the vector type (Bulkloader will be available in GA)
  • Community contributed PRs:
    • #9030: Add support for Polish Language
    • #9047: Reduce x.ParsedKey memory allocation from 72 to 56 bytes
  • Dgraph/Badger fixes:
    • #9007: Fix deadlock occurring due to time-out
    • #2018: Reduce resource consumption on empty write transaction
  • Update to Golang v1.22 - performance and monitoring improvements
  • Upgraded Golang client
  • Number of CVE Fixes

We are working towards a GA release candidate and expect it to be out in May. Dgraph v24 GA will also include GraphQL support for the vector data type and semantic search, a new caching approach that will boost performance of all applications, and a number of community PRs and maintenance fixes.

Note that this (alpha) release is not available on Dgraph Cloud, but the GA release will be released for both on-premise and Dgraph Cloud options. The release binaries and release notes are now available on GitHub. The docker images for dgraph/dgraph and dgraph/standalone are available on DockerHub.

A simple example of using vector embeddings and similarity search queries is shown below. More examples will follow in blog posts and docs in the coming weeks. This example talks about using Ratel for the schema update, mutations and queries, but you can use any approach.

Setup and install dgraph and ratel

Get a Dgraph docker container for the v24 alpha version

docker pull dgraph/standalone:v24.0.0-alpha2

Run a docker container, storing data on your local machine

mkdir ~/dgraph

docker run -d --name dgraph-v24alpha2 -p “8080:8080” -p “9080:9080” -v ~/dgraph:/dgraph dgraph/standalone:standalone:v24.0.0-alpha2

Then get and start the ratel tool

docker pull dgraph/ratel
docker run -d --name ratel -p "8000:8000" dgraph/ratel:latest

Ratel will now be running on localhost:8000

Add a schema, data and test queries

Define a DQL Schema. You can set this via the Ratel schema tab using the bulk edit option.

<Issue.description>: string .
<Issue.vector_embedding>: float32vector @index(hnsw(metric:"cosine")) .

type <Issue> {
    Issue.description
    Issue.vector_embedding
}

Notice that the new float32vector type is used, with a new index type of hnsw. The hnsw can use a distance metric of cosine, euclidean, or dotproduct . Here we use cosine similarity, which works great if your vectors are not going to be normalized.

At this point, the database will accept and index float vectors.

Insert some data containing short, test-only embeddings using this DQL Mutation

You can paste this into Ratel as a mutation, or use curl, pydgraph or similar:

{
"set": [
{
    "dgraph.type": "Issue",
    "Issue.vector_embedding": "[0.8, 0.8, 0.5, 0]",
    "Issue.description": "Intermittent timeouts. Logs show no such host error."
},
{
    "dgraph.type": "Issue",
    "Issue.vector_embedding": "[0, 0, 0, 0.7]",
    "Issue.description": "Bug when user adds record with blank surName. Field is required so should be checked in web page."
},
{
    "dgraph.type": "Issue",
    "Issue.vector_embedding": "[0.8, 0, 0.7, 0]",
    "Issue.description": "Delays on responses every 30 minutes with high network latency in backplane"
},
{
    "dgraph.type": "Issue",
    "Issue.vector_embedding": "[0.7, 0.8, 0.5, 0]",
    "Issue.description": "Slow queries intermittently. The host is not found according to logs."
},
{
    "dgraph.type": "Issue",
    "Issue.vector_embedding": "[0.6, 0.3, 1.0, 0]",=
    "Issue.description": "Some timeouts. It seems to be a DNS host lookup issue. Seeing No Such Host message."
},
{
    "dgraph.type": "Issue",
    "Issue.vector_embedding": "[0.5, 0.1, 0.7, 0.7]",
    "Issue.description": "Host and DNS issues are causing timeouts in the User Details web page"
}
]
}

A simple query that finds similar questions

You are ready to do similarity queries, to find Issues based on semantic similarity to a new Issue description!

For simplicity, we are not computing large vectors from an LLM. The embeddings above simply represent four concepts which are in the four vector dimensions: which are, respectively:

  • Slowness or delays
  • Logging or messages
  • Networks
  • GUIs or web pages

Use case and query

Let’s say a new issue comes in, and you want to use the text description to find other, similar issues you have seen in the past. Use the similarity query below:

If the new issue description is “Slow response and delay in my network!”, we represent this new issue as the vector [0.9, 0.8, 0, 0]. The first “slowness” dimension is high because the description mentions both “slow response” and “delay.” “Logs” is mentioned once, so set dimension two to 0.8. Neither networks nor GUIs are mentioned, so leave those at 0.

Note that the first parameter to similar_to is the DQL field name, the second parameter is the number of results to return, and the third parameter is the vector to look for.

query slownessWithLogs() {
simVec(func: similar_to(
    Issue.vector_embedding,
    3,
    "[0.9, 0.8, 0, 0]")) {
    uid
    Issue.description
}
}

If you want to send in data using parameters, rewrite this as

query test($vec: float32vector) {
simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) {
    uid
    Issue.description
}
}

And make a request (again using Ratel) with variable named “vec” set to a JSON float value:

vec: [0.9, 0.8, 0, 0]

Curl alternative

Finally, for those who do not prefer to use Ratel, you can do all these steps via HTTP tools, such as curl:

curl --location 'http://localhost:8080/query' \
    --header 'Content-Type: application/json' \
    --data ' {
        "query": "query test($vec: float32vector) { simVec(func: similar_to(Issue.vector_embedding, 3, $vec)) { uid Issue.description } }",
        "variables":{"$vec":"[1,0,0,0]"}
    }
'

Summing it up

This end-to-end example shows how you can insert data with vector embeddings, conforming to a schema with the new vector type and an index specifying a cosine similarity vector index, and do a semantic search for Issues via the new similar_to() function in Dgraph.

4 Likes

@gajanan - You should probably edit this to use markdown so that it is clear. Suround the code with three backticks, then graphql keyword after.

```graphql
function foo(bar){
return bar+1;
}
```

J

1 Like

Or… go here: Dgraph 24.0.0-alpha is now available on Github and DockerHub - Dgraph Blog

Thanks @jdgamble555. Fixed it.

1 Like

I was wondering, when I skimmed the v24 alpha changelog if all the noted performance boosts illuded to in the Q1 Update post are included at this point, or if they’re forth coming in future alpha/beta/release candidates.

Does this mean additional updates are forthcoming beyond what is listed now in the v24 alpha changelog?

Yes, that is correct there are forthcoming changes that will be included in the release candidate and the 24.0.0 GA. Is there anything specific improvement that you are waiting on?

I’m not waiting on anything super-specific at this time, however we’re closely monitoring the performance related improvements and will certainly test those out when they become available. Thanks for asking!

This probably isn’t the right venue to ask, but if you’re able to checkout the badger commands to verify they work with a maxLevels of 8 (instead of 7), we’d appreciate that. We had to increase levels to accommodate the data volume. We outlined in this post how if you change DGraph to run with 8 levels, you cannot set the badger utility commands to also run with 8 levels. It throws index out of range exceptions instead (Unable to reach leader in group 1 - dir structures help - #22 by rahst12)

Excellent! How are the vector similarity results sorted in the response? Most similar to least similar? Is there a “relevancy” score that could be offered?

Yes, and yes.

The results are sorted by similarity according to the metric used (cosine, euclidean, dotproduct), and by using the new dot operator you can also compute the score in results (the distance according to the metric, so lower distance = higher score). We may simplify this so including custom math functions in the result is not required. You can also return the vectors in a query, and compute custom scores or re-ordering in a client.

I see this as a likely replacement for any keyword or text search, so that relevance-based semantic results come back instead of pure keyworkd or term matches.

1 Like

@RickSalmon to provide more details.
At the lowest level (DQL) we are introducing:

  • a predicate type float32vector
  • a math function dot
  • a query function similar_to
  • an index type hsnw.
    With those new features, you can retrieve similar nodes, obtain their vector predicates, compute a distance or similarity score (dot, euclidean, or cosine). Using Dgraph variables you can order result by distance as you like.

In the GraphQL layer (coming soon), we get a step further and automatically generate query functions for similarity search returning distance out of the box. We will update the documentation accordingly.

1 Like

To support true vectors I would assume you would have to support repeating and sorted arrays. Is this a correct assumption? Will we see these things coming to the regular “array” types as well?

3 Likes

You are correct. We have introduced a new type float32vector that holds a vector.
The term “list” or “array” is a misnomer in our doc as Dgraph is handling lists as “sets” (we’ll double check if it is clearly explained).
A set of strings makes sense but a set of floats does not. We didn’t want to introduce breaking changes so we have not changed the current behavior of lists, including list of floats.
No plans to introduce a regular “array” for other types at this stage.

1 Like

If only sets and not arrays, then Dgraph is not GraphQL compliant, just FYI.

2 Likes

Adding examples to compute vector distances and similarity scores in DQL using dot product function.

2 Likes

Can you also add examples with GraphQL?

@iyinoluwaayoola Dgraph 24.0.0-alpha3 is now available on Github and DockerHub - Dgraph Blog describes the GraphQL implementation.
We made it easier in GraphQL: two new queries are auto-generated for each type containing at least one vector with index: querySimilar<Thing>ByEmbedding and querySimilar<Thing>ById. For each query, you can request the vector_distance field which is internally computed using the metric defined on the hnsw index. Doc will be updated with the content of those blog posts.

1 Like

Thanks for the ref @Raphael but what is the rationale for this api approach? Why not add the filter to the existing queryX api and keep things consistent? I.e,

queryProjecf(filter: { title: { title_v: { topK: …} }}){
title
}

@iyinoluwaayoola , we explored the use of filters. The filters express conditions on the value of predicates and use the graph traversal (relationships). The similarity using vector embedding introduces a new concept. The similarity is not a materialized relationship and this would cause lot of changes for filters. We found it a good first step to introduce the similarity search at root level with the generated queries. I can see valuable use cases to have similarity in filters for nested level. Please share details when you’ll face those use cases, we will see what can be done.
The two GraphQL queries using vector search open a lot of interesting use cases already. (RAG on Graph, multi-modal search as we support multiple embeddings for the same node, recommendation, semantic search, etc …).

1 Like

Thanks for explaining @Raphael. I was thinking along the line of nested queries, as my primary use-case, which is the reason for my question. Having it as part of filters will certainly cover the use cases you’ve mentioned and therefore most desirable.