Pages

Saturday, July 15, 2023

Unleashing the Power of OpenAI Embeddings in Vector Databases

Unleashing the Power of OpenAI Embeddings in Vector Databases: Embeddings and vector databases play a vital role in AI product development. In this article, I will explore their definition, applications, and integration with OpenAI APIs. By the end, you'll gain the ability to create long-term memory for a chatbot using GPT or perform semantic searches using an extensive PDF database connected to an AI system. Let's begin by understanding embeddings and vector databases, two interconnected terms.
OpenAI Embeddings and Vector Databases

Unleashing the Power of OpenAI Embeddings in Vector Databases

First, let's delve into embeddings. Simply put, an embedding refers to data, such as words, that has been transformed into a numeric array called a vector. These vectors capture patterns and relationships within the data. Each number within the vector represents a different aspect, forming a multidimensional map that enables measurement and analysis.

Read About: Learn To Earn On

Similarly, images can also be converted into vectors. Google utilizes this technique for similar image searches, where images are transformed into arrays of numbers. This allows for the identification of patterns of similarity among images with closely resembling vectors. Once an embedding is generated for a set of data points, such as words or sentences, it can be stored in a database.

A vector database is created by collecting and organizing these embeddings, allowing for efficient storage and retrieval of information. For instance, it can be used for searching, where results are ranked based on their relevance to a query string. It can also be employed for clustering, grouping text strings based on their similarity, or for recommendations, suggesting items with related text strings. Furthermore, vector databases can facilitate text classification, where text strings are categorized based on their most similar label.

For this article, the focus will be on searching since it's commonly used. OpenAI provides an excellent AI model specifically designed for creating embeddings. However, it does not offer a built-in method for storing them. Hence, a cloud database will be utilized later in the article.

Also Read: Health Related Topics

Now, let's begin creating an embedding by accessing OpenAI's resources.

Setting Up OpenAI Embeddings


To begin working with OpenAI embeddings, you can follow the steps below:

Step 1: OpenAI Website

Launch your preferred web browser and navigate to the esteemed OpenAI website by entering the URL www.openai.com. Upon arrival, you will be presented with the enticing opportunity to either forge a novel account if you have yet to do so or authenticate yourself using your pre-established credentials. Remarkably, the act of procuring an account on OpenAI carries no financial burden whatsoever, rendering it an invaluable proposition for all prospective users.

Step 2: Accessing the API Page

After logging in, access the API page within the OpenAI website. This page will provide you with a dashboard where you can explore various OpenAI APIs, including embeddings. Look for the relevant section or link that leads to embeddings.

Step 3: Documentation on Embeddings

Within the dashboard, locate the documentation specifically related to embeddings. This documentation will provide comprehensive information on how to work with embeddings. Click on the documentation link to access detailed instructions, guidelines, and examples.

Creating API Requests for Embeddings


To create API requests for generating embeddings, follow these steps:

Step 1: API References

Within the embedding documentation, you will find the API references section. This section contains all the necessary details, including the required request format, endpoints, and parameters, to construct a successful API request for generating embeddings.

Step 2: Request Format

To gain a comprehensive understanding of the requisite format and structure of the API request, it is highly recommended to refer to the API references diligently. These references will furnish you with intricate instructions on how to meticulously craft the API request. Typically, this involves meticulously concocting a POST request replete with the requisite inputs, such as the text or data for which you aspire to generate embeddings, and diligently dispatching it to the designated API endpoint for meticulous processing.

Step 3: Implementation Options

You have several implementation options to execute the API request for embeddings. You have the freedom to write code in any programming language to handle the request programmatically. Another option is to execute the request using tools like cURL or HTTP libraries in a terminal. Additionally, OpenAI provides an interactive interface on their website where you can directly input your request and receive the response.

To set up API requests for creating embeddings using Postman, follow these steps:

1. Postman Overview: Postman is an exemplary API platform that encompasses both a software application and a web app. This commendable tool boasts a plethora of advanced functionalities, making it an indispensable asset for executing API requests.

2. Obtain Postman: If, by chance, you have yet to acquire the prodigious capabilities of Postman, visit the illustrious Postman website forthwith. Once there, procure the version most compatible with your esteemed operating system. Proceed to install this remarkable software and inaugurate its functionality.

3. Establish a Workspace: Within the exalted realms of Postman, it is prudent to forge a novel workspace to accommodate your upcoming API endeavors. Meticulously bestow upon it a pertinent designation, such as "OpenAI Vector Database," and judiciously opt for the apt workspace type, be it personal or team-oriented.

4. Fashion a Request: Invoke the creation of a fresh tab within Postman, simulating the opening of a new tab in a web browser. In this resplendent tab, you shall expertly configure the request, meticulously outlining the requisites for generating an embedding of unparalleled sophistication.

5. Request Type: Select the appropriate request type. In this case, choose a "POST" request.

6. Specify the URL: Enter the URL for the embeddings API, which can be found in the OpenAI documentation.

7. Authorization: Postman will indicate that you require authorization to use the API endpoint. Follow the instructions provided to generate an API key on the OpenAI website. Copy the generated API key and paste it into Postman's authorization section.

8. Configure the Request Body: The OpenAI website provides all the necessary information to create the request. Specify the model and input for the embedding you want to create. For example, you can use OpenAI's text embedding model (e.g., ada002) and provide any text input you desire.

9. Send the Request: Once the request configuration is complete, click the "Send" button in Postman to send the POST request to OpenAI's API.

10. Review the Response: Postman will display the response received from OpenAI's API. This response confirms that the embedding creation was successful.

Our first embedding is quite straightforward, and if you preview it, you'll see a vector representation consisting of numerous numbers.

Different examples of embeddings that can be created:

1. Single-Word Embeddings: These are embeddings generated for individual words, such as "dog" or "cat." The embedding vector captures the semantic information associated with the word. Single-word embeddings are useful for tasks like search operations.

2. Multi-Word Embeddings: It is also possible to create embeddings for short sentences or phrases, like "OpenAI vectors and embeddings are easy." These embeddings provide a more nuanced representation, although they may appear similar to human readers. Multi-word embeddings allow for capturing the meaning of a phrase or sentence as a whole.

3. Chunking Large Texts: The true power of embeddings lies in chunking together significant amounts of information, such as paragraphs or entire document sections. By creating embeddings for larger text chunks, like entire sections of documents, you can generate embeddings that serve as a valuable resource for searching large databases.

To illustrate the scale of embedding large texts, imagine copying and pasting an entire page from a non-disclosure agreement (NDA) contract. When pasted into the input section in Postman, removing any line breaks or additional spaces, you can see the substantial amount of text being embedded. It is worth highlighting the fact that generating embeddings for this extensive text requires a similar duration as generating an embedding for a solitary word.

Now that we have acquired the ability to generate embeddings, the crucial next step entails establishing a robust mechanism for their storage. OpenAI, regrettably, does not furnish databases explicitly designed for this purpose. Consequently, we must undertake the responsibility of fashioning our own customized solution to fulfill this essential need. A database filled with embeddings is commonly referred to as a vector database.

Vector Database


Setting Up a Vector Database with SingleStore

To set up a vector database with SingleStore, follow these steps:

Provider Introduction: SingleStore is a provider that offers a real-time unified distributed SQL database. It is cloud-based and user-friendly. One of its advantages is the ability to incorporate vector databases seamlessly.

Account Setup: Start by creating an account with SingleStore. Account registration is free and includes additional credits and the ability to set up unlimited databases. If you already have an account, sign in using your Google credentials on the main dashboard.

Workspace Creation: After signing in, create a workspace to manage your database. Give it a suitable name. This workspace serves as the environment for your vector database. You can create a new workspace from the dashboard.

1. Vector Database - Creating a Database


Imminently, you shall embark on the creation of a sophisticated database tailored for your workspace.

Selecting the Cloud Platform:

Exercise your discretion to choose amongst the formidable options of AWS, Google, or Microsoft Azure as the cloud platform. While my inclination leans towards AWS, the onus of choice rests with you. When it comes to the region, opt for the one in closest proximity or adhere to the default option of US West.

Configuring Workspace Settings:

In the subsequent stage, you shall be granted the power to fine-tune the parameters of your workspace, such as velocity, CPU allocation, and RAM allocation. However, for the purpose of this tutorial, we shall suffice with the rudimentary essentials. Proceed with the default advanced settings.

Workspace Setup:

The comprehensive setup of the workspace shall be executed surreptitiously in the background. Upon its triumphant culmination, you may proceed to manifest the inception of your database.

Creating a Database:

Unfolding the Workspace and Establishing the Database:

Vigilantly discard any informational panels and initiate access to the sacred realm of your workspace. As you traverse towards the left, you shall stumble upon the hallowed interface of the workspace. On the right side, with the utmost deliberation, beseech the beckoning option of "Create a Database."

Naming the Database:

Conjure forth a name befitting the grandeur of your database, perchance "Open AI Database," or perchance an epithet of your own inclination. Upon such deliberation, firmly acquaint your cursor with the hallowed button of creation, thereby setting in motion the birthing process.

Viewing the Database:

As the culmination of creation transpires, the presence of your database shall be unveiled amidst the vast expanse of the interface. Such unveiling beckons you to acquaint yourself with its intricate details. However, it must be noted that, at this nascent stage, the database remains bereft of any tables or data.

Working with the SQL Editor:

Now, thou shalt proceed to the hallowed domain of the SQL editor, where thou shalt wield the power to invoke profound commands, thereby orchestrating the creation of tables and the imbuing of data into thy esteemed database.

2. Vector Database - Creating a Table


This section focuses on creating a table within the Vector Database.

Selecting the Database:

Begin by selecting the database you are currently using, specifically the "Open AI Database."

Writing the SQL Query:

Construct a simple SQL query using the "CREATE TABLE IF NOT EXISTS" syntax. The table will be named "MyVectorTable," enclosed within brackets.

Defining Column Types:

Specify the column types for the table. For simplicity in this demonstration, we will limit the column types to "text" for the original text column and "vector" as a "blob" type.

Executing the Command:

Execute the SQL query to create the table and confirm its successful creation in the log.

Viewing the Table:

Access the database interface to view the created table's structure, including the defined columns and their respective data types.

Additional Column Types:

While more complex tables can include column types such as integers, numbers, decimals, and attributes like ID or URL, for this demonstration, we will maintain a straightforward structure.

Examining Sample Data:

Explore the sample data section to observe the absence of any existing rows in the newly created table.

Creating the First Row:

Proceed with creating the initial row and inserting a vector into the database.

Vector Database - Inserting an Embedding Row

In the SQL editor, we will now write a new syntax to insert a row into the database.

Inserting into the Table:

Use the "INSERT INTO" syntax followed by the table name, which in this case is "MyVectorTable" enclosed in brackets. Specify the attributes we want to fill, which are "text" and "vector," and provide the corresponding values.

Obtaining Input:

Copy the input from Postman, for example, "Hello world," and paste it into the values section.

Handling the Embedding:

For the embedding, use the "Json_array_pack" function to convert it into the required blob structure. In Postman, locate the embedding value and copy it. In the interface, paste the embedding value.

Executing the Command:

Run the command to insert the embedding into the vector database. The result should indicate the successful addition of one row.

Viewing the Result:

Check the database interface, specifically the "MyVectorTable" under the "Sample Data" section. The newly added row should be visible.

Adding More Data:

To expand the database, retrieve additional examples from Postman. Copy the input and embedding values, and insert them as rows into the database. This process can be repeated for multiple examples.

Reviewing the Sample Data:

Return to the database and examine the "Sample Data" section. You should now observe multiple rows with simple yet sufficient data for further operations.

Performing Searches:

Searching a vector database for embeddings is relatively straightforward. The first step involves identifying the desired search term. Next, create an embedding for the search term. Finally, perform a search against the existing embeddings in the database, which will return a list ranked by similarity, with the closest matches at the top.

Vector Database - Searching Embeddings


Proceeding to the SQL editor, we will now construct a query to search for embeddings.

Writing the Query:

Utilize the "SELECT" statement to retrieve the "text" column. Employ the "dot_product" function to calculate the score by passing in the "vector" column. Additionally, use the "Json_array_pack" function to convert the array of values to be added for scoring purposes. The query will be based on the "MyVectorTable" and ordered by the score in descending order. Limit the results to five.

Creating an Embedding for Searching:

In Postman, create an embedding for the desired search term, such as "open AI." Copy the corresponding vector and paste it into the SQL query. This vector will be utilized as a reference for the scoring system.

Executing the Command:

Run the query and observe the results. The search will return vectors that are similar to the search term, with scores indicating the ranking of similarity.

Performing Multiple Searches:

To further explore, create another embedding, such as "hello Earth," and use its corresponding vector for the search. Execute the query to observe any changes in the results. The scores determine the similarity ranking, with higher scores indicating closer matches to the content already in the database.

Creating a Function with JavaScript:

This section will involve using JavaScript on Node.js to interact with embeddings. Begin by creating a new folder named "openai-vectors-embeddings."

Embedding Function with JavaScript and Node.js

For seamless integration with the Open AI API, we shall construct a function utilizing JavaScript and Node.js. Within the "index.js" file, we shall establish the requisite headers to establish a connection with the Open AI API.

Setting Up Headers:

Begin by meticulously defining the mandatory headers in the code. Ensure that the request method is meticulously set to "POST" and the content type is meticulously specified as "application/JSON." Additionally, incorporate the authorization token as a bearer token, which can be directly passed for simplicity in this instance. However, it is crucial to handle and securely store the token, ideally employing an environmental key.

Creating the Function:


Efficiently craft an asynchronous function denominated "createEmbedding," which accepts a solitary parameter named "text." This parameter represents the textual content necessitating embedding. The function will deftly handle the intricacies of making an API request to the esteemed Open AI.

Making the API Request:

Within the function's purview, adroitly employ the fetch command to deftly dispatch a POST request to the esteemed Open AI API's version 1 of embeddings ("/v1/embeddings"). Carefully specify the requisite parameters, encompassing the aforementioned request method as "POST," the aforedefined headers, and the indispensably pivotal request body. Skillfully harness the power of JSON.stringify() to aptly convert the post data into a judiciously crafted JSON string. The post data, of course, shall be replete with the eminently crucial model (text-embedding) and the input (text primed for embedding).

Handling the Response:

With a watchful eye, keenly observe the response emanating from the esteemed server and adroitly ascertain its success. Should the response be commensurate with our lofty expectations, it is incumbent upon us to sagaciously log the response and meticulously incorporate it as an integral part of the function's output. In a masterstroke of ingenuity, leverage the response.json() method to consummately transmute the response into a consummate JSON object. Assiduously pluck the essential data from this exquisitely crafted JSON object and judiciously log it for future reference.

Testing the Function:


To adequately put this remarkable function to the test, exhort it to action by invoking the "createEmbedding" function and duly supplying the desired text, befittingly encapsulated within quotation marks (e.g., "Hello world"). Invoke the function with unwavering resolve, utilizing the "node index.js" command within the hallowed confines of the terminal. Swiftly and with aplomb, the resulting embedding shall be returned, akin to the exalted outcomes attained through the diligent utilization of Postman. This invaluable embedding data can subsequently be harnessed for a myriad of purposes, including the-sagacious storage within a robust and capable database.

Conclusion


OpenAI Embeddings and Vector Databases represent a substantial breakthrough in natural language processing, elevating the accuracy and efficiency of text-related tasks and creating new avenues for intelligent systems. It is crucial to remain mindful of data quality, privacy concerns, and ethical implications while utilizing these technologies. With continued research and development, OpenAI Embeddings and Vector Databases have the potential to reshape how we interpret and make sense of textual data, marking a significant advancement in the field of natural language processing.

No comments:

Post a Comment