Postgres as a Vector Database using Python

Postgres is not a dedicated vector database. It is an advanced, enterprise open-source relational database that supports SQL querying. But PostgreSQL comes with a variety of modules and extensions. So, using external extensions PostgreSQL supports vectors. Then we can store vectors in PostgreSQL and make it as a vector database. Here, let’s discuss about how we can use PostgreSQL as a vector database using Python.

Vector databases are used to store vector embeddings for fast retrieval, similarity search, and other operations like crud operations. Simply, embedding is a numerical array that includes a huge number of features. So, using vector databases we can perform a lot of useful things on that numerical representation.

In traditional databases like MySQL, PostgreSQL, and SQL Server we are usually querying for rows in the database where the value matches our input query. In vector databases, we apply a similarity metric to find a vector that is the most similar to our query. There are a lot of dedicated vector databases out there such as Pinecone, Qdrant, Chroma DB, etc.

So, let’s learn how we can use Postgres as a vector database.

Establish Postgres connection with Python

To establish the connection between the PostgreSQL database and Python, we have to install the psycopg2 module. We can use the following command line from the terminal to install it.

pip install psycopg2

Then we can connect a database using python.

import psycopg2
connection = psycopg2.connect(
    database = “database name that you want to connect”,
    user = “username used to authenticate”,
    host= “database server address. Localhost or an IP address”,
    password = “password used to authenticate”,
    port = “port numbers. Default is 5432”
)

Set the auto-commit property to True. This means that each SQL statement we execute will be treated as a transaction and automatically committed to the database. Otherwise, we have to commit manually after each transaction.

connection.autocommit = True

Then, let’s create a cursor object. Which is an object used to manage the context of a fetch operation in databases.

cur = connection.cursor()

Enable Vector Extension

Now let’s enable the ‘vector’ extension to work with vector embeddings with the Postgres database. (do this once in each database where you want to use it)

cur.execute("CREATE EXTENSION vector;")

Initialize Embedding Technique

We can use any embedding method to generate embedding for sentences. Here, I choose a sentence transformer model.

from sentence_transformers import SentenceTransformer
embedding_model_name = 'all-MiniLM-L6-v2'
embedding_length = 384
model = SentenceTransformer(embedding_model_name)

Click here to see a list of pre-trained models which can be used for our task. Each embedding model has its embedding length.

Create Collection

Let’s create a simple collection with id, text, and embedding.

# Create a collection
collection_name = 'collection1'
cur.execute("CREATE TABLE IF NOT EXISTS {} (id SERIAL PRIMARY KEY,text TEXT,embedding vector({}))".format(collection_name, embedding_length))

Get All Collections

# retrive all collections
cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' AND table_type = 'BASE TABLE';")
cur.fetchall()

Insert Data

Let’s insert sample data into our created collection.

# perform embedding
texts = ["Language Models are designed to understand and generate human-like text based on patterns learned from vast amounts of data",
"Blockchain is a decentralized, distributed ledger technology that records transactions across a network of computers",
"Unsupervised learning algorithms explore and extract patterns from unlabeled data, discovering inherent structures."]

embeddings = model.encode(texts)

# insert data to a collection
collection_name = 'collection1'
for text, embedding in zip(texts, embeddings):
    text = f"'{text}'"
    embedding = f"'{list(embedding)}'"
    query = "INSERT INTO {} (text, embedding) VALUES ({}, {})"
    cur.execute(query.format(collection_name, text, embedding))

Update Data

id = 1
text = "Evaluating the quality of text generated by LLMs involves assessing factors like coherence, relevance, grammaticality, and avoidance of biases"

embedding = model.encode(text)

text = f"'{text}'"
embedding = f"'{list(embedding)}'"
query = "UPDATE {} SET text = {}, embedding = {} WHERE id = {};"

cur.execute(query.format(collection_name, text, embedding, id))

Indexing

Indexing techniques enable a faster search. For that, we can use hashing techniques, quantization techniques, or graph-based techniques. Few index types support Postgres.

Here we used IVFFlat type, which tries to divide vectors into lists, and then searches a subset of those lists that are closest to the query vector. Click here to read more about it.

# create indexing on cosine distance
collection_name = 'collection1'
cur.execute("CREATE INDEX ON {} USING ivfflat (embedding vector_cosine_ops);".format(collection_name))

Query

When a vector database receives a query, it compares indexed vectors to the query vector to determine the nearest vector neighbors. To do that we can use different types of similarity measures. In the following example, we try to use cosine similarity. Other than that, we can use Euclidean distance or Dot product.

collection_name = 'collection1'

query = "SELECT * FROM {} ORDER BY embedding <=> {} LIMIT 1;"

new_text = 'Fine-tuning a Language Model involves tailoring its learned knowledge to a specific domain or task, enhancing its performance in specialized applications'

new_embedding = f"'{list(model.encode(new_text))}'"

cur.execute(query.format(collection_name, new_embedding))

result = cur.fetchall()
print(result)

Click here to see how we can use other distance measures while querying data.

Distance

By using the following code sample, we can get similar results with any kind of distance measurement approach.

collection_name = 'collection1'

new_text = 'Fine-tuning a Language Model involves tailoring its learned knowledge to a specific domain or task, enhancing its performance in specialized applications'

new_embedding = f"'{list(model.encode(new_text))}'"

query = "SELECT 1 - (embedding <=> {}) AS cosine_similarity FROM {};"

cur.execute(query.format(new_embedding, collection_name))

result = cur.fetchall()
print(result)

Delete Collection

# drop a collection
collection_name = 'collection1'
cur.execute("DROP TABLE {};".format(collection_name))

Code: Jupyter notebook