Building Multimodal AI in TypeScript

January 24, 2024 · 12 min read

Developer Advocate

Cover, Building Multimodal AI in TypeScript

For a lot of people, Multimodal AI and its derivatives are strong contenders for technology of the year. The promises of Multimodal applications of artificial intelligence are exciting and speak to behavioral changes in media consumption and general interaction with the internet. I'm here to show you how you can get a piece of the pie. In this article, we’ll look at how to build Multimodal applications in TypeScript and dive into everything that needs to happen in between.

What are modalities in AI

A modality is a particular mode in which something exists or is experienced or expressed.

For machine learning programs, we use modality to describe the type of input or output that a machine learning model can process or interact with to run.

Single Modality

Historically, search has been over text, emphasizing the relevance of keyword search. This type of search was focused on matching text search terms to large datasets of text and returning the most relevant. This was the single modality; text search through data and get back text.

Even with the onset of multimedia; images, videos and audio. We leveraged metadata and attached text to these more complex modalities and essentially ran keyword searches. Still a single-modality search.

Early versions of large language models followed suit. With most being single modality models, with a single model inference.

Single Modality Architecture

Multi-Modality

In today's world, not only are people optimizing for TikTok searches but the video hosting service is the search engine of choice for a majority of Gen Z. Multimodality enables search queries with multiple media types over multiple media types. Enabling you to text search over datasets of video or mixed media.

Mostly enabled by vector and hybrid search, Vector databases by design are uniquely adapted to storing, indexing and enabling efficient retrieval for multimodal use cases. On the machine learning model side of things, we’re seeing the rise of multimodal models that can infer numerous modalities.

This means we can interact with an LLM via chat and have the LLMs generate text, audio or video. By definition, this would enable interaction with similar models via text or even video.

Multimodal Architecture

Building our own Multimodal Search

Now that we have a better understanding of multi-modal search, let’s build a search application with Weaviate and Next.js

Specwise, we want to be able to run a text search through image, video and audio data without leveraging the file's metadata.

Requirements

You need the following to go through with this tutorial.

An LTS version of Node.js
Docker
Git

The project that this tutorial is based on is available on GitHub if you’d like to give it a try before going through with the tutorial.

Create your Next.js application that comes with TypeScript, Tailwind CSS and the App Router with the following command.

create-next-app <project-name> –ts –tailwind –app

Step 1: Getting Weaviate Running

To get started with Weaviate, we'll create a Weaviate instance on Weaviate Cloud Services as described in this guide. Weaviate is an AI-Native database. It gives you the flexibility to pick what embedding models you use. Embedding models come in all shapes and sizes, for this project, you will be using Vertex's multimodal embedding models.

Once set up add your Weaviate URL, Admin API key and Vertex API key to a .env file in the root of your project.

Step 2: Importing Data

Next, we will create a Weaviate collection and import data into it. In your ./public folder, create three folders called image, audio, and video. These will store media corresponding to their folder names respectively. You can then add your data or use the data I added to build out the original application.

We then install the Weaviate TypeScript client and other project dependencies with the following command.

yarn add weaviate-client use-debounce dotenv

To import our data, we need to create a folder called import where our import scripts will go.

To start, create a file called client.ts and paste the following code in it to initialize the Weaviate client.

import weaviate, { type WeaviateClient } from 'weaviate-client';
import 'dotenv/config'

let client: WeaviateClient;

export const getWeaviateClient = async () => {
  if (!client) {
    client = await weaviate.connectToWeaviateCloud(process.env.WEAVIATE_URL || '',{
    authCredentials: new weaviate.ApiKey(process.env.WEAVIATE_API_KEY || ''),
    headers: {
      'X-Palm-Api-Key': process.env.GOOGLE_API_KEY || '',  // Replace with your inference API key
    }
  }
)
  };
  
 return client;
}

Next, still in ./src we create a file called util.ts and paste the following code in it.

import { readdirSync, readFileSync } from 'fs'

export interface FileInfo {
    name: string;
    path: string;
}

export const listFiles = (path: string): FileInfo[] => {
    return readdirSync(path).map((name) => {
        return {
            name: name,
            path: `${path}${name}`,
        }
    });
}

export const getBase64 = (file: string) => {
    return readFileSync(file, { encoding: 'base64' });
}

This file defines base64 as the encoding for the files we are importing and lists all the files in a directory.

We then create a file called collection.ts, where we define our Weaviate collection and its properties.

import weaviate, { type WeaviateClient } from 'weaviate-client';
import { getWeaviateClient } from './client';

const client: WeaviateClient = await getWeaviateClient();

const collectionExists = async (name: string) => {
  return client.collections.exists(name);
}

export const createCollection = async (name: string) => {
  if(await collectionExists(name)) {
    console.log(`The collection [${name}] already exists. No need to create it.`);
    return;
  }
  
  console.log(`Creating collection [${name}].`);

  const newCollection = await client.collections.create({
    name: name,
    vectorizers: weaviate.configure.vectorizer.multi2VecPalm({
      projectId: 'semi-random-dev',
      location: 'us-central1',
      imageFields: ['image'],
    }),
    generative: weaviate.configure.generative.openAI(),
    properties: [
      {
        name: 'name',
        dataType: 'text',
      },
      {
        name: 'image', 
        dataType: 'blob',
      },
    ]
  })
  
  console.log(JSON.stringify(newCollection, null, 2));
}

export const deleteCollection = async (name: string) => {
  console.log(`Deleting collection ${name}...`);
  await client.collections.delete(name);

  console.log(`Deleted collection ${name}.`);
}

This file contains a function to create a collection, a function to delete a collection, and a function to check if a collection exists. If there are any details we want to change that would define our collections, we can do so in this file.

Notably, we create a new collection and define media types of image, and video and to our multi2vecPalm module. We also add the following properties to our collection: name, image, and optionally you can add audio. Think of these as properties we will be able to query against when we start to make queries or import data to Weaviate.

Now we create a file called import.ts that uses a batching system to bulk import files.

import { WeaviateClient, generateUuid5, type Collection } from 'weaviate-client';
import { getWeaviateClient } from './client.ts';
import { getBase64, listFiles } from './util.ts';

const sourceBase = 'public';
const sourceImages = sourceBase + '/image/';
const sourceAudio = sourceBase + '/audio/'; // for Audio 
const sourceVideo = sourceBase + '/video/';

const client: WeaviateClient = await getWeaviateClient();


export const importMediaFiles = async (collectionName: string) => {

    const vertexCollection = client.collections.get(collectionName); 

    await insertImages(vertexCollection);
    // await insertAudio(vertexCollection);  # Uncomment to import Audio
    await insertVideo(vertexCollection);
}

const insertImages = async (myCollection: Collection) => {

    const batchSize = 10;
    let dataObject = [];

    const files = listFiles(sourceImages);
    console.log(`Importing ${files.length} images.`);

    let counter = 0

    for (let file of files) {
        console.log(`Adding ${file.name}`);
        
        const item = {
            name: file.name,
            extension: file.name.split('.')[1],
            image: getBase64(file.path),
            media: 'image',
        };

        dataObject.push(item);
        counter++

        if (counter % batchSize == 0) {
            await myCollection.data.insertMany(dataObject);

            dataObject = []
        }
    }

    if (counter % batchSize !== 0)
        await myCollection.data.insertMany(dataObject);
}
// Uncomment if you are using a model with an audio encoder
// const insertAudio = async (myCollection: Collection) => {

//     const batchSize = 5;
//     let dataObject = [];

//     const files = listFiles(sourceAudio);
//     console.log(`Importing ${files.length} audios.`);

//     let counter = 0;
//     for (let file of files) {
//         console.log(`Adding ${file.name}`);

//         const item = {
//             name: file.name,
//             extension: file.name.split('.')[1],
//             audio: getBase64(file.path),
//             media: 'audio',
//         };

//         dataObject.push(item);
//         counter++;

//         if (counter % batchSize == 0) {
//             await myCollection.data.insertMany(dataObject);
//             // Clear the dataObject array
//             dataObject = [];
//         }
//     }

//     if (counter % batchSize !== 0)
//         await myCollection.data.insertMany(dataObject);
// }

const insertVideo = async (myCollection: Collection) => {

    const batchSize = 1;

    const files = listFiles(sourceVideo);
    console.log(`Importing ${files.length} videos.`);

    console.log('meta', await myCollection.config.get())

    for (let file of files) {
        console.log(`Adding ${file.name}`);

        const item = {
            name: file.name,
            extension: file.name.split('.')[1],
            video: getBase64(file.path),
            media: 'video',
        };

        await myCollection.data.insert(item)
    }
}

This file reads out images, and video directories for media and then encodes them before sending them to Weaviate to be stored.

We then create a file called index.ts where we run all the files we created above.

import { createCollection, deleteCollection } from './collection';
import { importMediaFiles } from './import';

const collectionName = 'PalmMultimodalSearch';

const run = async () => {
  await deleteCollection(collectionName);
  await createCollection(collectionName);
  await importMediaFiles(collectionName);
}

run();

In index.ts we can change our collection name to our liking. When this file is run, we delete any existing collection that exists with the name we specify, create a new collection with the same name and then import all the media files in our .public folder.

To run our import process, we need to add the following scripts to our package.json file.

  …
"scripts": {
      …
    "import": "npx tsc && node import/index.js"
      …
  },
  …

We can now run yarn run import to start the import process. Depending on how much data you're importing, it can take a little time. While that runs, let's create our Search interface.

Step 3: Building Search Functionality

Now we need to create a couple of components that will go in our web application.

Create a ./components folder and add a file called search.tsx. Paste the following code in it.

'use client'

import { useSearchParams, usePathname, useRouter } from "next/navigation"
import { useDebouncedCallback } from "use-debounce"

export default function Search(
    { placeholder }: { placeholder: string},
    
    ) {
    const searchParams = useSearchParams()
    const pathname = usePathname()
    const { replace } = useRouter()

    const handleSearch = useDebouncedCallback((term: string) => {
        const params = new URLSearchParams(searchParams.toString())
        if (term) {
            params.set("search", term)
        } else {
            params.delete("search")
        }
        replace(pathname + "?" + params)
    }, 300);
    return (
        <div className="pt-20">
            <div className="relative">
                <label htmlFor="search" className="sr-only"> Search </label>
                <input
                    type="search"
                    id="search"
                    placeholder={placeholder}
                    onChange={(e) => handleSearch(e.target.value)}
                    defaultValue={searchParams.get("search")?.toString()}
                    className="w-[400px] h-12 rounded-md bg-gray-200 p-2 shadow-sm sm:text-sm"
                />
            </div>
        </div>
    )
}

We have an input tag that takes search terms from users. We then pass these search terms to our URL params and then use the replace method from next/navigation to update the URL. This will update the URL and trigger a new search when the user stops typing.

Optionally, you can import the footer and navigation components. We don’t really need them to demonstrate Multimodal search but they make the app look nicer!

Next in ./utils/action.ts, paste the following code.

"use server";

import weaviate from "weaviate-client";


export async function vectorSearch(searchTerm: string) {
  const client = await weaviate.connectToWeaviateCloud(process.env.WEAVIATE_URL || '',{
    authCredentials: new weaviate.ApiKey(process.env.WEAVIATE_API_KEY || ''),
    headers: {
      'X-Palm-Api-Key': process.env.GOOGLE_API_KEY || ''
    }
  },
);
    
const myCollection = client.collections.get('PalmMultimodalSearch');


const response = await myCollection.query.nearText(searchTerm, {
limit: 8,
returnMetadata: ['distance'],
})

return response
   
  }

Here we define a function vectorSearch(), which makes a call to Weaviate to search for the given search term. Weaviate lets us run a vector search from our PalmMultimodalSearch collection with nearText() and pass the user-entered search term as a query. We also define metadata we want to access in the response with returnMetadata and the number of results we want to return with limit. We can now call this server action from the client side of our application.

In the ./app folder, paste the following code in page.tsx,

import Footer from '@/components/footer.tsx';
import Search from '../components/search.tsx'
import { vectorSearch } from '@/utils/action.ts';

import Navigation from '@/components/navigation.tsx';

export default async function Home({
  searchParams
}: {
  searchParams?: {
    search?: string;
  }
}) {
  const search = searchParams?.search || "people";
  const data = await vectorSearch(search);

  return (
    <html lang="en">
      <body>
        <div className="fixed h-screen w-full bg-gradient-to-br from-lime-100 via-teal-50 to-amber-100" />
        
        <Navigation /> 

        <main className="flex min-h-screen w-full flex-col items-center justify-center py-32">

          <Search placeholder="Search for a word" />
          
          <div className="relative flex grid grid-cols-1 gap-4 lg:grid-cols-4 lg:gap-8 p-20">

            {data.objects.map((result) => (
              <div key={result.uuid} className="">

                <div className="h-40 w-50">
                  <div className="flex justify-between">
                  <p className="w-16 h-6 mt-2 ml-2 block text-center whitespace-nowrap items-center justify-center rounded-lg translate-y-8 transform  bg-white px-2.5 py-0.5 text-sm text-black">
                    { result?.properties.media }
                  </p>
                  <p className="w-24 h-6 mt-2 mr-2 block text-center whitespace-nowrap items-center justify-end rounded-lg translate-y-8 transform  bg-white px-2.5 py-0.5 text-sm text-black">
                    dist: { result.metadata?.distance?.toString().slice(0,6) }
                  </p>
                  </div>
                  {result?.properties.media == 'image' &&
                    <img
                      alt="Certainty: "
                      className='block object-cover w-full h-full rounded-lg'
                      src={
                        '/' + result.properties.media + '/' + result.properties.name
                      }
                    />
                  }

                  {result.properties.media == 'video' &&
                    <video controls src={
                      '/' + result.properties.media + '/' + result.properties.name
                    } className='block object-fit w-full h-full rounded-lg'>
                      Your browser does not support the video element.
                    </video>
                  }
                </div>
              </div>
            ))}
          </div>
        </main>

        <Footer />

      </body>
    </html>
  )
}

Here was call the vectorSearch() server action we created above and pass the query response to be displayed on the UI. Now we have our search working!

Final Result

Then you should be able to search like this

Conclusion

We just saw how to leverage a Multimodal AI model to power our Multimodal search in TypeScript with Next.js. A lot is happening in the space and this only just scratches the surface of Multimodal use cases. One could take this a step further and enable search via image, video or audio uploads. Another avenue worth exploring is Generative Multimodal models to either enhance results or interact with existing datasets. If this is something you find interesting or would love to continue a conversation on, find me online at malgamves.

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

Slack

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

What are modalities in AI​

Single Modality​

Multi-Modality​

Building our own Multimodal Search​

Requirements​

Step 1: Getting Weaviate Running​

Step 2: Importing Data​

Step 3: Building Search Functionality​

Final Result​

Conclusion​

Ready to start building?​