Build a smart product data generator from image with GPT-4o and Langchain

Jun 25, 2024 · 8 min read
Share on
Build a smart product data generator from image with GPT-4o and Langchain

When listing new products to an online store, owners or marketers often find it too time-consuming to fill in the essential information such as title, description, and tags for each product from scratch. Most of the information can be retrieved from the product image itself. With the right combination of LLM and AI tools, such as Langchain and OpenAI, we can automate the process of writing product's information using an input of image, which is our focus in today's post.

Table of contents

Brief introduction about Langchain and OpenAI

Langchain is a powerful tool that allows you to architect and run AI-powered functions with ease. It provides a simple interface to integrate with different LLMs (Large-Language-Models) APIs and services such as OpenAI, Hugging Face, etc. It also offers an extensible architecture that allows you to create and manage custom chains (pipelines), agents, and workflows tailored to your specific needs.

OpenAI is a leading AI research lab that has developed several powerful LLMs, including GPT-3, GPT-4 and Dall-E. These models can generate human-like text and media based on the input prompt, making them ideal for a wide range of applications, from chatbots to content/image generation.

Setting up Langchain and OpenAI

In this post, we will use GPT-4o model from OpenAI for better image anayzing and text completion, along with the following Langchain Python packages:

  • langchain-openai - A package that provides a simple interface to interact with OpenAI API.
  • langchain_core - The core package of Langchain that provides the necessary tools to build your AI functions.

To install these packages, you use the following command:

python -m pip install langchain-openai langchain-core

Next, let's define the flow of how we generate product information based on a given image.

The flow of generating product data

Our tool will perform the following steps upon receiving an image URL from the user:

  1. Load the given product image into base64 data URI text format.
  2. Ask GPT to analyze and generate the required product's metadata based on such data.
  3. Extract the result from GPT in a structured Product format.

The below diagram demonstrates how the our work flow looks like:

Diagram flow of generating product data

With this flow in mind, let's walk through each step's implementation in detail.

Step 1: Load an product image into base64 format

Before we can ask GPT to generate a product's metadata from a given image URL, we need to convert it into a format that GPT can understand, which is base64 data URI. To do so, we will create an image.py with the following code:

import base64

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

The above encode_function function takes an image_path, opens and reads the image into bytes format, and then returns the encoded based64 text version.

We then write a load_image function, which performs the following:

  • Receives inputs as a dictionary, which contains an image_path key with the path to the image file,
  • Reads inputs[image_path] into base64 format using base64.b64encode() method.
  • Assigns the result to image property of the returned object for the function.

The code is as follows:

def load_image(inputs: dict) -> dict:
    """Load image from file and encode it as base64."""
    image_file = inputs["image_path"]
    image_base64 = encode_image(image_file)
    return {
        "image": image_base64
    }

Now we have the image processing step implemented. Next, we will create a function to communicate with GPT for the information desired based on this image data.

Step 2: Ask GPT to generate a product's metadata

In this step, since we are going to send request to GPT API, we need to set up its API's key for related Langchain OpenAI package to pick up and initialize the service.

Setting up OpenAI API key

The most straighforward way is to create an .env file with an OPENAI_API_KEY variable, whose value can be found under Settings panel, as shown below:

Screenshot of how to retrieve API key in OpenAI Panel
OPENAI_API_KEY=your-open-ai-api-key

Then, we install python-dotenv package using the below command:

python -m pip install python-dotenv

And in our generate.py file, we add the following code to load the key from the .env file into our project for usage:

import os
from dotenv import load_dotenv

load_dotenv()

And with that, we can implement the function that will invoke the GPT model for answers.

Creating a model to process the image and prompt

In generate.py, we create a function image_model that takes inputs as a dictionary containing the fields: image and prompt, where image is the base64 data URI from step 1.

def image_model(inputs: dict):
 """Invoke model with image and prompt."""

    image = inputs["image"]
    prompt = inputs["prompt"]

From the given inputs, we compute a user's message to pass to the model. To do so, we use HumanMessage class from langchain_core.messages package:

message = HumanMessage(
    content=[
        {"type": "text", "text": prompt},
        {
            "type": "image_url", 
            "image_url": {
                "url": f"data:image/jpeg;base64,{ image }"
            }
        },
    ]
)

In the above code, we pass to HumanMessage an array of content containing:

  • A text object with the prompt text
  • An image_url object with the base64-encoded image data as the URL

Once we have the message ready, we then initialize a model instance of ChatOpenAI using gpt-4o, an 0.5 temperature and a maximum number of 1024 tokens:

from langchain_openai import ChatOpenAI

def image_model(inputs: dict):
 """Invoke model with image and prompt."""

    #... previous code
    model = ChatOpenAI(temperature=0.5, model="gpt-4o", max_tokens=1024)

And invoke the model with the message and return the content of the response, as follows:

def image_model(inputs: dict):
 #... previous code
 result = model.invoke(message)
 return result.content

At this stage, we have the content of the response from GPT. In the next step, we will extract that content in a structured Product format.

Step 3: Extract the result from GPT in a structured Product format

The response from GPT is always in a text format, which requires us to parse and extract the relevant information in a structured Product format. This is not a straightforward step. Fortunately, Langchain provides us several tools to help us with this task, starting with defining the output structure format.

Define the Product structure

We will define a Product class as a Pydantic model using BaseModel and Field from the langchain.pydantic_v1 package, as shown below:

# Product.py
from langchain_core.pydantic_v1 import BaseModel, Field

class Product(BaseModel):
    '''Product description'''
    title: str = Field(..., title="Product Title", description="Title of the product")
    description: str = Field(..., title="Product Description", description="Description of the product")
    tags: list = Field([], title="Product Tags", description="Tags for SEO")

The above class defines a Product model with the following fields:

  • title - The title of the product
  • description - The description of the product
  • tags - The tags for SEO

Next, we declare a parser function that will extract the GPT response into the Product structure.

Create a function to extract the product information

We can use JsonOutputParser class to create a custom parser by passing our Product structure as its pydantic_object, as follows:

from langchain_core.output_parsers import JsonOutputParser

#... previous code
parser = JsonOutputParser(pydantic_object=Product)

Great. All left is to modify our content array in Step 2 to include the parser's format instructions, by adding the following element to the array:

content = [
    #... previous code
    {"type": "text", "text": parser.get_format_instructions()},
    {
        "type": "image_url", 
       # ... code
    },
]

And with that, all the components for the flow is ready. It's time to chain them together.

Chaining all the steps together using Langchain

Chaining is similar to a train of action carriage, where each carriage can be a step of LLM call, data transformation, or any tool connected together, supporting streaming, async and batch processing out of the box. In our case, we will use TransformChain for transforming our image_path input into a proper base64 data input as a pre-processing step of the main flow.

from langchain.chains import TransformChain

load_image_chain = TransformChain(
    input_variables=['image_path'],
    output_variables=["image"],
    transform=load_image
)

From there, we create another generate_product_chain that chains all the flow components together using | operator, starting with loading and transforming the image path into a base64 data URI text, then passing its output as the input to our image model for generating the desired data, and finally parsing the result into our target Product format:

generate_product_chain = load_image_chain | image_model | parser

Finally, we define get_product_info function to invoke the chain with the initial input image_path and prompt as follows:

def get_product_info(image_path: str) -> dict:
generate_product_chain = load_image_chain | image_model | parser

prompt = f"""
   Given the image of a product, provide the following information:
   - Product Title
   - Product Description
   - At least 13 Product Tags for SEO purposes
"""

return generate_product_chain.invoke({
    'image_path': image_path, 
    'prompt': prompt
})

And that's it! We have successfully built a smart product information generator. You can now use the get_product_info function to generate product information by giving it a valid image path:

product_info = get_product_info("path/to/image.jpg")
print(product_info)
Diagram flow of generating product data

Resources

Summary

In this post, we have explored how to generate essential product data such as title, description and tags based on a given image using Langchain, Open AI GPT-4o. We have walked through the flow, including loading an image into base64 text format, asking GPT to generate a product's metadata, and extracting the result from GPT in a structured Product format. We have also seen how to chain all the steps together using Langchain to create a working product information generator.

In the next post, we will explore how to deploy this tool as a web service API using Flask. Until then, happy coding!

👉 Learn about Vue 3 and TypeScript with my new book Learning Vue!

👉 If you'd like to catch up with me sometimes, follow me on X | LinkedIn.

Like this post or find it helpful? Share it 👇🏼 😉

Share on

Learning Vue

Learn the core concepts of Vue.js, the modern JavaScript framework for building frontend applications and interfaces from scratch

Get a copy
Learning Vue