Skip to main content

Building Multimodal AI in TypeScript

· 12 min read
Daniel Phiri

Cover, Building Multimodal AI in TypeScript

For a lot of people, Multimodal AI and its derivatives are strong contenders for technology of the year. The promises of Multimodal applications of artificial intelligence are exciting and speak to behavioural changes in media consumption and general interaction with the internet. I'm here to show you how you can get a piece of the pie. In this article, we’ll look at how to build Multimodal applications in Typescript and dive into everything that needs to happen in between.

Modalities

A modality is a particular mode in which something exists or is experienced or expressed.

Single Modality

Historically, search has been over text, emphasizing the relevance of keyword search. This type of search was focused on matching text search terms to large datasets of text and returning the most relevant. This was the single modality; text search through data and get back text.

Even with the onset of multimedia; images, videos and audio. We leveraged metadata and attached text to these more complex modalities and essentially ran keyword searches. Still a single-modality search.

Early versions of large language models followed suit. With most being single modality models, with a single model inference.

Single Modality Architecture

Multi-Modality

In today's world, not only are people optimizing for TikTok searches but the video hosting service is the search engine of choice for a majority of Gen Z. Multimodality enables search queries with multiple media types over multiple media types. Enabling you to text search over datasets of video or mixed media.

Mostly enabled by vector and hybrid search, Vector databases by design are uniquely adapted to storing, indexing and enabling efficient retrieval for multimodal use cases. On the machine learning model side of things, we’re seeing the rise of multimodal models that can infer numerous modalities.

This means we can interact with an LLM via chat and have the LLMs generate text, audio or video. By definition, this would enable interaction with similar models via text or even video.

Multimodal Architecture

Now that we have a better understanding of multi-modal search, let’s build a search application with Weaviate and Next.js

Specwise, we want to be able to run a text search through image, video and audio data without leveraging the file's metadata.

Requirements

You need the following to go through with this tutorial.

The project that this tutorial is based on is available on Github if you’d like to give it a try before going through with the tutorial.

Create your Next.js application that comes with TypeScript, Tailwind CSS and the App Router with the following command.

create-next-app <project-name> –ts –tailwind –app

Getting Weaviate Running

In the newly created folder, create a docker-compose.yml file and paste the following code into it.

version: '3.4'
services:
weaviate:
image: cr.weaviate.io/semitechnologies/weaviate:||site.weaviate_version||
restart: on-failure:0
ports:
- "8080:8080"
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'multi2vec-bind'
ENABLE_MODULES: 'multi2vec-bind'
BIND_INFERENCE_API: 'http://multi2vec-bind:8080'
CLUSTER_HOSTNAME: 'node1'
multi2vec-bind:
image: cr.weaviate.io/semitechnologies/multi2vec-bind:imagebind
environment:
ENABLE_CUDA: '0'

Then run docker-compose up -d in your terminal to start an instance of Weaviate with ImageBind, our Multimodal model. This also enables us to use the multi2vec-bind module, the tool that enables us to store vector embeddings of video, text, audio and image data.

Memory requirements

Make sure that you have enough memory allocated to Docker. The multi2vec-bind module loads a very large model, and you may need to allocate more memory (e.g. 12GB or higher) to Docker to avoid out-of-memory errors.

Importing Data

Next, we will create a Weavaite collection and import data into it. In your ./public folder, create three folders called image, audio, and video. These will store media corresponding to their folder names respectively. You can then add your data or use the data I added to build out the original application.

We then install the Weaviate TypeScript client and other project dependencies with the following command.

yarn add weaviate-ts-client use-debounce rimraf

To import our data, we need to create a folder called src where our import scripts will go.

To start, create a file called client.ts and paste the following code in it to initialize the Weaviate client.

import weaviate, { WeaviateClient } from 'weaviate-ts-client';

let client: WeaviateClient;

export const getWeaviateClient = () => {
if (!client) {
client = weaviate.client({
scheme: 'http',
host: 'localhost:8080',
});
};

return client;
}

Next, still in ./src we create a file called util.ts and paste the following code in it.

import {readdirSync, readFileSync, } from 'fs'

export interface FileInfo {
name: string;
path: string;
}

export const listFiles = (path: string): FileInfo[] => {
return readdirSync(path).map((name) => {
return {
name: name,
path: `${path}${name}`,
}
});
}

export const getBase64 = (file: string) => {
return readFileSync(file, { encoding: 'base64' });
}

This file defines base64 as the encoding for the files we are importing and lists all the files in a directory.

We then create a file called collection.ts, where we define our Weaviate collection and its properties.

import { WeaviateClient } from 'weaviate-ts-client';
import { getWeaviateClient } from './client';

const client: WeaviateClient = getWeaviateClient();

const collectionExists = async (name: string) => {
return client.schema.exists(name);
}

export const createBindCollection = async (name: string) => {
if(await collectionExists(name)) {
console.log(`The collection [${name}] already exists. No need to create it.`);
return;
}

console.log(`Creating collection [${name}].`);

const bindSchema = {
class: name,
moduleConfig: {
'multi2vec-bind': {
imageFields: ['image'],
audioFields: ['audio'],
videoFields: ['video'],
}
},
properties: [
{
name: 'name',
dataType: ['text'],
moduleConfig: {
'multi2vec-bind': { skip: true },
},
},
{
name: 'media',
dataType: ['text'],
moduleConfig: {
'multi2vec-bind': { skip: true },
},
},
{
name: 'image',
dataType: ['blob'],
},
{
name: 'audio',
dataType: ['blob'],
},
{
name: 'video',
dataType: ['blob'],
}
],
vectorIndexType: 'hnsw',
vectorizer: 'multi2vec-bind'
}

const res = await client
.schema.classCreator()
.withClass(bindSchema)
.do();

}

export const deleteCollection = async (name: string) => {
console.log(`Deleting collection ${name}...`);
await client.schema
.classDeleter()
.withClassName(name)
.do();

console.log(`Deleted collection ${name}.`);
}

This file contains a function to create a collection, a function to delete a collection, and a function to check if a collection exists. If there are any details we want to change that would define our collections, we can do so in this file.

Notably, we create a new class and define media types of image, video and audio to our multi2vec-bind module. We also add the following properties to our collection: media, name, image, audio, and video. Think of these as properties we will be able to query against when we start to make queries or import data to Weavaite.

Now we create a file called import.ts that uses Weaviates ObjectsBatcher to batch import files.

import { ObjectsBatcher, WeaviateClient, generateUuid5 } from 'weaviate-ts-client';
import { getWeaviateClient } from './client';
import { FileInfo, getBase64, listFiles } from './util';

const sourceBase = 'public';
const sourceImages = sourceBase + '/image/'
const sourceAudio = sourceBase + '/audio/';
const sourceVideo = sourceBase + '/video/';

const client: WeaviateClient = getWeaviateClient();

export const importMediaFiles = async (collectionName: string) => {
await insertImages(collectionName);
await insertAudio(collectionName);
await insertVideo(collectionName);
}

const insertImages = async (collectionName: string) => {
let batcher: ObjectsBatcher = client.batch.objectsBatcher();
let counter = 0;
const batchSize = 5;

const files = listFiles(sourceImages);
console.log(`Importing ${files.length} images.`)

for (const file of files) {
const item = {
name: file.name,
image: getBase64(file.path),
media: 'image'
};

console.log(`Adding [${item.media}]: ${item.name}`);

batcher = batcher.withObject({
class: collectionName,
properties: item,
id: generateUuid5(file.name)
});

if (++counter == batchSize) {
console.log(`Flushing ${counter} items.`)

// flush the batch queue
await batcher.do();

// restart the batch queue
counter = 0;
batcher = client.batch.objectsBatcher();
}
}

if (counter > 0) {
console.log(`Flushing remaining ${counter} item(s).`)
await batcher.do();

}
}


const insertAudio = async (collectionName: string) => {
let batcher: ObjectsBatcher = client.batch.objectsBatcher();
let counter = 0;
const batchSize = 3;

const files = listFiles(sourceAudio);
console.log(`Importing ${files.length} audio files.`)

for (const file of files) {
const item = {
name: file.name,
audio: getBase64(file.path),
media: 'audio'
};

console.log(`Adding [${item.media}]: ${item.name}`);

batcher = batcher.withObject({
class: collectionName,
properties: item,
id: generateUuid5(file.name)
});

if (++counter == batchSize) {
console.log(`Flushing ${counter} items.`)
// flush the batch queue
await batcher.do();

// restart the batch queue
counter = 0;
batcher = client.batch.objectsBatcher();
}
}

if (counter > 0) {
console.log(`Flushing remaining ${counter} item(s).`)
await batcher.do();
}
}

const insertVideo = async (collectionName: string) => {
let batcher: ObjectsBatcher = client.batch.objectsBatcher();
let counter = 0;
const batchSize = 1;

const files = listFiles(sourceVideo);
console.log(`Importing ${files.length} video files.`)

for (const file of files) {
const item = {
name: file.name,
video: getBase64(file.path),
media: 'video'
};

console.log(`Adding [${item.media}]: ${item.name}`);

batcher = batcher.withObject({
class: collectionName,
properties: item,
id: generateUuid5(file.name)
});

if (++counter == batchSize) {
console.log(`Flushing ${counter} items.`)
// flush the batch queue
await batcher.do();

// restart the batch queue
counter = 0;
batcher = client.batch.objectsBatcher();
}
}

if (counter > 0) {
console.log(`Flushing remaining ${counter} item(s).`)
await batcher.do();
}
}

This file reads out images, video and audio directories for media and then encodes them before sending them to Weaviate to be stored.

We then create a file called index.ts where we run all the files we created above.

import { createBindCollection, deleteCollection } from './collection';
import { importMediaFiles } from './import';

const collectionName = 'BindExample';

const run = async () => {
await deleteCollection(collectionName);
await createBindCollection(collectionName);
await importMediaFiles(collectionName);
}

run();

In index.ts we can change our collection name to our liking. When this file is run, we delete any existing collection that exists with the name we specify, create a new collection with the same name and then import all the media files in our .public folder.

It's important to note that we need our Weaviate instance running with Docker before going to the next step.

To run our import process, we need to add the following scripts to our package.json file.


"scripts": {

"preimport": "rimraf ./build && tsc src/*.ts --outDir build",
"import": "npm run preimport && node build/index.js"

},

We can now run yarn run import to start the import process. Depending on how much data you're importing, it can take a little time. While that runs, let's create our Search interface.

Building Search Functionality

Now we need to create a couple of components that will go in our web application.

Create a ./components folder and add a file called search.tsx. Paste the following code in it.

'use client'

import { useSearchParams, usePathname, useRouter } from "next/navigation"
import { useDebouncedCallback } from "use-debounce"

export default function Search({ placeholder }: { placeholder: string }) {
const searchParams = useSearchParams()
const pathname = usePathname()
const { replace } = useRouter()

const handleSearch = useDebouncedCallback((term: string) => {
const params = new URLSearchParams(searchParams.toString())
if (term) {
params.set("search", term)
} else {
params.delete("search")
}
replace(pathname + "?" + params)
}, 300);
return (
<div className="pt-20">
<div className="relative">
<label htmlFor="search" className="sr-only"> Search </label>
<input
type="search"
id="search"
placeholder={placeholder}
onChange={(e) => handleSearch(e.target.value)}
defaultValue={searchParams.get("search")?.toString()}
className="w-[400px] h-12 rounded-md bg-gray-200 p-2 shadow-sm sm:text-sm"
/>
</div>
</div>
)
}

We have an input tag that takes search terms from users. We then pass these search terms to our URL params and then use the replace method from next/navigation to update the URL. This will update the URL and trigger a new search when the user stops typing.

Optionally, you can import the footer and navigation components. We don’t really need them to demonstrate Multimodal search but they make the app look nicer!

In the ./app folder, paste the following code in page.tsx,

// import Footer from '@/components/footer';
// import Navigation from '@/components/navigation';
import Search from '../components/search'
import Link from 'next/link'

import weaviate, {
WeaviateClient,
WeaviateObject,
} from "weaviate-ts-client";

const client: WeaviateClient = weaviate.client({
scheme: 'http',
host: 'localhost:8080',
});


export default async function Home({
searchParams
}: {
searchParams?: {
search?: string;
}
}) {
const search = searchParams?.search || "";
const data = await searchDB(search);


return (
<html lang="en">
<body>
<div className="fixed h-screen w-full bg-gradient-to-br from-lime-100 via-teal-50 to-amber-100" />

{/* <Navigation /> */}

<main className="flex min-h-screen w-full flex-col items-center justify-center py-32">

<Search placeholder="Search for a word" />

<div className="relative flex grid grid-cols-1 gap-4 lg:grid-cols-4 lg:gap-8 p-20">

{data.map((result: WeaviateObject) => (
<div key={result?._additional.id} className="">

<div className="h-40 w-50">
<p className=" w-16 h-6 mt-2 ml-2 block text-center whitespace-nowrap items-center justify-center rounded-lg translate-y-8 transform bg-white px-2.5 py-0.5 text-sm text-black">
{result.media}
</p>

{result?.media == 'image' &&
<img

alt="Certainty: "
className='block object-cover w-full h-full rounded-lg'
src={
'/' + result.media + '/' + result.name
}
/>
}

{result?.media == 'audio' &&
<audio controls src={
'/' + result.media + '/' + result.name
} className='block object-none w-full h-full rounded-lg'>
Your browser does not support the audio element.
</audio>

}

{result.media == 'video' &&
<video controls src={
'/' + result.media + '/' + result.name
} className='block object-none w-full h-full rounded-lg'>
Your browser does not support the video element.
</video>

}

</div>

</div>
))}
</div>
</main>
{/* <Footer /> */}
</body>
</html>
)
}


async function searchDB(search: string) {

const res = await client.graphql
.get()
.withClassName("BindExample")
.withFields("media name _additional{ certainty id }")
.withNearText({ concepts: [`${search}`] })
.withLimit(8)
.do();

return res.data.Get.BindExample;
}

Here we define a function searchDB(), which makes a call to Weaviate to search for the given search term. We then map the results to the UI. Weaviate lets us run a vector search from our BindExample collection with withNearText() and pass the user-entered search term as a query. We can define properties we want to search on with withFields() and the number of results we want to return with withLimit(). We then pass the query response to be displayed on the UI.

Final Result

Then you should be able to search like this

the final multimodal search project

Conclusion

We just saw how to leverage a Multimodal AI model to power our Multimodal search in Typescript with Next.js. A lot is happening in the space and this only just scratches the surface of Multimodal use cases. One could take this a step further and enable search via image, video or audio uploads. Another avenue worth exploring is Generative Multimodal models to either enhance results or interact with existing datasets. If this is something you find interesting or would love to continue a conversation on, find me online at malgamves.

What's next