Introduction

As a web developer, you know that natural language processing (NLP) is becoming increasingly important for enhancing the user experience of your website. Whether it's analyzing customer reviews, generating product descriptions, or providing personalized recommendations, NLP can help you extract valuable insights from text data and create more engaging interactions with your customers.

In this blog post, we'll see how to use LangChain and GPT-3 with data from an e-commerce database to create a Q&A assistant. LangChain is a powerful library that can help you connect your data with Large Language Models (LLMs), while GPT-3 is a state-of-the-art language model that can generate high-quality text output. Together, these tools can help you extract valuable insights from your data and create more engaging interactions with your customers.

Whether you're an experienced data scientist or a developer just getting started with NLP, this post will provide you with a practical guide to using LangChain and GPT-3 with database data. So, let's get started!

Setting up your environment

Before we can get started with using LangChain and GPT-3, we need to make sure that we have all the necessary libraries and dependencies installed. In this step, we'll walk you through the process of setting up your environment.

  1. First, you'll need to make sure that you have Conda installed on your machine. Conda is a package manager that allows you to create virtual environments and manage dependencies. If you don't have Conda installed, you can download it from the official website: https://docs.conda.io/en/latest/miniconda.html

  2. Once you have Conda installed, open a terminal or command prompt and create a new environment for this project. You can do this by running the following command:

    conda create --name gpt-3-assistant python=3.10

    This will create a new environment called "gpt-3-assistant" with Python 3.10 installed.

  3. Activate your new environment by running the following command:

    conda activate gpt-3-assistant
  4. Next, we need to install the necessary libraries and dependencies. You can install them all at once by running the following command:

    pip install langchain transformers openai mysql-connector-python faiss-cpu pandas jupyterlab
  5. Finally, launch JupyterLab by running the following command:

    jupyter lab

    By following these steps, you should now have a working environment set up.

Creating a MySQL Database and Inserting Dummy Data

Before we can fetch the product data from a database, we need to have a database with tables that contain the product information. In this step, we'll show you how to create a MySQL database and tables for clothes products, and insert some dummy data into those tables.

First, let's create a new MySQL database called 'my_shop'. You can use any MySQL client to create the database, or you can use the following code to create it using Python:

import mysql.connector

mydb = mysql.connector.connect(
  host="your_host",
  user="your_username",
  password="your_password"
)

mycursor = mydb.cursor()
mycursor.execute("CREATE DATABASE my_shop")

Now that we have a database, let's create two tables: 'products' and 'product_properties'. The 'products' table will contain the product names, descriptions, prices, url and categories. The 'product_properties' table will contain the properties of each product, such as size and color. Here's the SQL code to create the tables:

mycursor.execute("""CREATE TABLE my_shop.products (
  id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  product_name VARCHAR(255) NOT NULL,
  description TEXT,
  price DECIMAL(10,2) NOT NULL,
  category VARCHAR(255) NOT NULL,
  url VARCHAR(255) NOT NULL
);
""")

mycursor.execute("""CREATE TABLE my_shop.product_properties (
  id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  product_id INT(11) NOT NULL,
  property_name VARCHAR(255) NOT NULL,
  property_value VARCHAR(255) NOT NULL,
  FOREIGN KEY (product_id) REFERENCES products(id) ON DELETE CASCADE
);
""")

Now that we have the tables, let's insert some dummy data into them. Here's some sample Python code to insert some products and their properties:

import mysql.connector

mydb = mysql.connector.connect(
  host="your_host",
  user="your_username",
  password="your_password",
  database="my_shop"
)

mycursor = mydb.cursor()

# Insert some products
sql = "INSERT INTO products (product_name, description, price, category, url) VALUES (%s, %s, %s, %s, %s)"
values = [
  ('T-Shirt', "Looking for a comfortable and stylish t-shirt to add to your wardrobe? Check out our large red cotton t-shirt! Made from soft and breathable 100% cotton, this shirt is perfect for everyday wear or for dressing up for a special occasion.<br>The vibrant red color is eye-catching and bold, making it a great choice for making a statement or standing out in a crowd. And because it's machine-washable, it's easy to care for and keep looking great.<br>Whether you're running errands, hanging out with friends, or hitting the town for a night out, this t-shirt is a versatile and comfortable option that you'll love wearing again and again.", 19.99, 'Clothing', '/red-t-shirt'),
  ('Jeans', "Looking for a timeless and versatile addition to your wardrobe? Check out our classic blue denim jeans in size 32x32! Made from high-quality denim, these jeans are comfortable, durable, and designed to last.<br>The deep blue color is perfect for any occasion, from casual outings to more formal events. And with a size of 32x32, they're the perfect fit for many body types.<br>These jeans feature a classic straight leg design and a mid-rise waistline, making them a flattering and comfortable option for everyday wear. And with their sturdy construction and reinforced stitching, they're built to withstand even the toughest wear and tear.<br>Whether you're dressing up or dressing down, these blue denim jeans are a versatile and stylish choice that will never go out of fashion.", 49.99, 'Clothing', '/blue-jeans'),
  ('Hoodie', "Stay warm and stylish this season with our black hoodie in size medium! Made from soft and cozy fabric, this hoodie is the perfect addition to your casual wardrobe.<br>The classic black color and versatile design make it easy to pair with any outfit, while the warm and cozy material ensures you'll stay comfortable even on the chilliest days.<br>This hoodie features a comfortable and flattering fit in size medium, with a roomy hood and front pocket for added convenience. Whether you're running errands, heading to the gym, or just lounging at home, this hoodie is the perfect choice for a cozy and comfortable look.", 29.99, 'Clothing', '/black-hoodie')
]
mycursor.executemany(sql, values)

# Insert some product properties
sql = "INSERT INTO product_properties (product_id, property_name, property_value) VALUES (%s, %s, %s)"
values = [
  (1, 'Size', 'Large'),
  (1, 'Color', 'Red'),
  (2, 'Size', '32x32'),
  (2, 'Color', 'Blue'),
  (3, 'Size', 'Medium'),
  (3, 'Color', 'Black')
]
mycursor.executemany(sql, values)

mydb.commit()

By following these steps, you should now have a MySQL database with tables for clothes products, and some dummy data inserted into those tables.

Fetch Data from MySQL Database and Create DataFrame

Now that we have established a connection to the MySQL database and created the necessary tables, we can fetch the product data and create a pandas DataFrame. This will allow us to manipulate and analyze the data.

Here is the code to fetch the data and create a DataFrame:

import pandas as pd

# create a cursor object to execute SQL queries
cursor = mydb.cursor()

# fetch all products and their properties from the database
query = """
SELECT p.*, pp.property_name, pp.property_value
FROM my_shop.products p
LEFT JOIN my_shop.product_properties pp ON p.id = pp.product_id
"""
cursor.execute(query)

# create a pandas DataFrame from the fetched data
columns = [col[0] for col in cursor.description]
data = cursor.fetchall()
df = pd.DataFrame(data, columns=columns)

# close cursor and database connection
cursor.close()
mydb.close()

Transforming the DataFrame into a Usable Format

We will now transform the data into a single string to match the LangChain Document format. By doing so, we can more easily feed the data into LangChain and extract valuable insights from it. Additionally, we will keep the URL of the source data, which will serve as metadata in the LangChain Document format. This will enable us to trace back the origin of the information if needed.

product_data = df.groupby('id').agg({
    'product_name': lambda x: x.iloc[0],
    'description': lambda x: x.iloc[0],
    'price': lambda x: x.iloc[0],
    'category': lambda x: x.iloc[0],
    'property_name': lambda x: list(x),
    'property_value': lambda x: list(x),
    'url': lambda x: x.iloc[0]
}).reset_index()

product_data['content'] = product_data.apply(
    lambda x: f"Item is {x['product_name']}. {x['description']}. Price is ${x['price']}. Category is {x['category']}. " + 
              " ".join([f"{prop_name} is {prop_val}." for prop_name, prop_val in zip(x['property_name'], x['property_value'])]),
    axis=1
)

product_data = product_data[['content', 'url']]

Now that we have a DataFrame that contains the product data in a usable format, we can create a new Document object for each row of the DataFrame. We will then append each Document object to an array, which we can use later for analysis.

Here's the code to create a new Document object for each row of the DataFrame and append it to an array:

from langchain.docstore.document import Document

sources = []
# iterate over each row of the dataframe
for index, row in product_data.iterrows():
    doc = Document(
        page_content=row.data,
        metadata={"source": row.url},
    )
    # append the Document object to the list
    sources.append(doc)

As we work with our documents, it's important to keep in mind that they may end up quite lengthy. Since the OpenAI API has a token limit of approximately 4,000 tokens, we need to be prepared to handle documents that exceed this limit. Fortunately, LangChain offers a set of helpers that allow us to easily chunk our content using delimiters.

We will be using the RecursiveCharacterTextSplitter.

from langchain.text_splitter import RecursiveCharacterTextSplitter

chunks = []
splitter = RecursiveCharacterTextSplitter(
    separators=["\n", ".", "!", "?", ",", " ", "<br>"],
    chunk_size=1024,
    chunk_overlap=0
)
for source in sources:
    for chunk in splitter.split_text(source.page_content):
        chunks.append(Document(page_content=chunk, metadata=source.metadata))

Creating the Embeddings

First, let's use OpenAI embeddings. Embeddings are vector representations of text that capture the semantic meaning of words and phrases. These embeddings are useful because they allow us to compare and measure the similarity between different pieces of text in a way that makes sense to us humans.

For example, if we have two product descriptions that use different words to describe similar features, embeddings can still recognize that they are talking about the same thing. This is because the embeddings are based on the meaning of the words and phrases, rather than just their surface-level representation.

Then we use a vector database to store them because it allows us to efficiently perform similarity searches. By storing our embeddings as vectors, we can use algorithms like cosine similarity to find similar items in our database. This is particularly useful when we want to find products that are similar to a given query, or when we want to recommend similar items to a user.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS

index = FAISS.from_documents(chunks, OpenAIEmbeddings(openai_api_key='xxx'))

Prompt

Prompting is a crucial aspect of using GPT-3, as it determines the nature of the response that will be generated. However, crafting effective prompts can be a challenge in and of itself, requiring a deep understanding of language and the specific task at hand. Fortunately, LangChain provides default prompts that can be used as a starting point for generating responses to various inputs. While these prompts may not be perfectly suited to every use case, they are a useful tool for getting started with GPT-3, and can be fine-tuned or customized as needed. In this blog post, we will be using the default prompts provided by LangChain by just making sure the response is.

from langchain.prompts import PromptTemplate

template = """Act like you are a fashion sale representative. Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). 
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer.

QUESTION: {question}
=========
{summaries}
=========
FINAL ANSWER:"""
PROMPT = PromptTemplate(template=template, input_variables=["summaries", "question"])

Generating Answers using GPT-3

Now that we have completed all the necessary preparations, we can finally feed our data to LangChain and prompt the AI with questions related to our product data. Let's see how GPT-3 responds.

from langchain.llms import OpenAIChat
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

chain = load_qa_with_sources_chain(OpenAIChat(openai_api_key='xxx', temperature=0, model_name="gpt-3.5-turbo"), prompt=PROMPT)

def answer(question):
    print(
        chain(
            {
                "input_documents": index.similarity_search(question, k=4),
                "question": question,
            },
            return_only_outputs=True,
        )["output_text"]
    )

answer("It's really hot outside, what product would you recommend?")
If it's really hot outside, we recommend our large red cotton t-shirt. It's made from soft and breathable 100% cotton, which will keep you cool and comfortable. The vibrant red color is eye-catching and bold, making it a great choice for making a statement or standing out in a crowd. And because it's machine-washable, it's easy to care for and keep looking great. You can find it in the Clothing category for $19.99. Source: /red-t-shirt

After providing our data to GPT-3, we can witness how it successfully selects the most relevant data based on the prompt and generates a coherent response. This showcases the efficiency of GPT-3's language models and how they can provide valuable insights into the data we provide.

Conclusion

In this blog post, we explored how to use LangChain to perform similarity searches on a MySQL database containing product data, and then use OpenAI's GPT-3 to generate natural language responses to user queries.

We started by connecting to the database, retrieving the data, and transforming it into LangChain's Documents. We then used LangChain to convert the text into embeddings, and stored them in a vector database for efficient similarity search.

Next, we used OpenAI's GPT-3 to generate natural language responses to user queries. We demonstrated how to use the default prompt provided by LangChain, and how to fine-tune the prompt for better results.

Finally, we combined everything and showed how to prompt the model with a user question and retrieve the most relevant response from the database. We also discussed the potential limitations of the approach, including the need to tune the model for better performance and the size of the database.

Overall, LangChain and OpenAI's GPT-3 offer powerful tools for processing and generating natural language responses to user queries. By following the steps outlined in this blog post, you can leverage these technologies to build sophisticated natural language applications that can understand and respond to user queries in real-time.