CodePRO LK

Postgres is not a dedicated vector database. It is an advanced, enterprise open-source relational database that supports SQL querying. But PostgreSQL comes with a variety of modules and extensions. So, using external extensions PostgreSQL supports vectors. Then we can store vectors in PostgreSQL and make it as a vector database. Here, let’s discuss about how we can use PostgreSQL as a vector database using Python. Vector databases are used to store vector embeddings for fast retrieval, similarity search, and other operations like crud operations. Simply, embedding is a numerical array that includes a huge number of features. So, using vector databases we can perform a lot of useful things on that numerical representation.In traditional databases like MySQL, PostgreSQL, and SQL Server we are usually querying for rows in the database where the value matches our input query. In vector databases, we apply a similarity metric to find a vector that is the most similar to our query. There are a lot of dedicated vector databases out there such as Pinecone, Qdrant, Chroma DB, etc.So, let’s learn how we can use Postgres as a vector database… Establish Postgres connection with Python To establish the connection between the PostgreSQL database and Python, we have to install the psycopg2 module. We can use the following command line from the terminal to install it.pip install psycopg2Then we can connect a database using pythonimport psycopg2connection = psycopg2.connect( database = “database name that you want to connect”, user = “username used to authenticate”, host= “database server address. Localhost or an IP address”, password = “password used to authenticate”, port = “port numbers. Default is 5432”)Set the auto-commit property to True. This means that each SQL statement we execute will be treated as a transaction and automatically committed to the database. Otherwise, we have to commit manually after each transaction.connection.autocommit = TrueThen, let’s create a cursor object. Which is an object used to manage the context of a fetch operation in databases.cur = connection.cursor() Enable Vector Extension Now let’s enable the ‘vector’ extension to work with vector embeddings with the Postgres database. (do this once in each database where you want to use it)cur.execute("CREATE EXTENSION vector;") Initialize Embedding Technique We can use any embedding method to generate embedding for sentences. Here, I choose a sentence transformer model.from sentence_transformers import SentenceTransformerembedding_model_name = 'all-MiniLM-L6-v2'embedding_length = 384model = SentenceTransformer(embedding_model_name)Click here to see a list of pre-trained models which can be used for our task. Each embedding model has its embedding length. Create Collection Let’s create a simple collection with id, text, and embedding.# Create a collectioncollection_name = 'collection1'cur.execute("CREATE TABLE IF NOT EXISTS {} (id SERIAL PRIMARY KEY,text TEXT,embedding vector({}))".format(collection_name, embedding_length)) Get All Collections # retrive all collectionscur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' AND table_type = 'BASE TABLE';")cur.fetchall() Insert Data Let’s insert sample data into our created collection.# perform embeddingtexts = ["Language Models are designed to understand and generate human-like text based on patterns learned from vast amounts of data", "Blockchain is a decentralized, distributed ledger technology that records transactions across a network of computers", "Unsupervised learning algorithms explore and extract patterns from unlabeled data, discovering inherent structures."]embeddings = model.encode(texts)# insert data to a collectioncollection_name = 'collection1'for text, embedding in zip(texts, embeddings): text = f"'{text}'" embedding = f"'{list(embedding)}'" query = "INSERT INTO {} (text, embedding) VALUES ({}, {})" cur.execute(query.format(collection_name, text, embedding)) Update Data id = 1text = "Evaluating the quality of text generated by LLMs involves assessing factors like coherence, relevance, grammaticality, and avoidance of biases"embedding = model.encode(text)text = f"'{text}'"embedding = f"'{list(embedding)}'"query = "UPDATE {} SET text = {}, embedding = {} WHERE id = {};"cur.execute(query.format(collection_name, text, embedding, id)) Indexing Indexing techniques enable a faster search. For that, we can use hashing techniques, quantization techniques, or graph-based techniques. Few index types support Postgres.Here we used IVFFlat type, which tries to divide vectors into lists, and then searches a subset of those lists that are closest to the query vector. Click here to read more about it.# create indexing on cosine distancecollection_name = 'collection1'cur.execute("CREATE INDEX ON {} USING ivfflat (embedding vector_cosine_ops);".format(collection_name)) Query When a vector database receives a query, it compares indexed vectors to the query vector to determine the nearest vector neighbors. To do that we can use different types of similarity measures. In the following example, we try to use cosine similarity. Other than that, we can use Euclidean distance or Dot product.collection_name = 'collection1'query = "SELECT * FROM {} ORDER BY embedding {} LIMIT 1;"new_text = 'Fine-tuning a Language Model involves tailoring its learned knowledge to a specific domain or task, enhancing its performance in specialized applications'new_embedding = f"'{list(model.encode(new_text))}'"cur.execute(query.format(collection_name, new_embedding))result = cur.fetchall()print(result)Click here to see how we can use other distance measures while querying data. Distance By using the following code sample, we can get similar results with any kind of distance measurement approach.collection_name = 'collection1'new_text = 'Fine-tuning a Language Model involves tailoring its learned knowledge to a specific domain or task, enhancing its performance in specialized applications'new_embedding = f"'{list(model.encode(new_text))}'"query = "SELECT 1 - (embedding {}) AS cosine_similarity FROM {};"cur.execute(query.format(new_embedding, collection_name))result = cur.fetchall()print(result) Delete Collection # drop a collectioncollection_name = 'collection1'cur.execute("DROP TABLE {};".format(collection_name)) Jupyter Notebook

CodePRO LK

Biography

Courses

Data Structures and Algorithms

Machine Learning Project – Laptop Price Predictor

Machine Learning Project – Sentiment Analysis

Python GUI – Tkinter

Top Courses

Resources

Support