Skip to main content

Scaling limits with collections

Added in v1.30
Have you hit the collections count limit?

If you see an error that the collection count limit has been reached, it means you can't create any more collections. This limit aims to ensure Weaviate remains performant. If your instance already exceeds the limit, Weaviate will not allow the creation of any new collections. Existing collections will not be deleted.

Instead of simply raising the limit, consider rethinking your architecture. If you really need to change the limit, use the MAXIMUM_ALLOWED_COLLECTIONS_COUNT environment variable.

This guide offers an overview of the available architectural choices of using multi-tenancy or defining a dedicated collection for each subset of data.

Consider a scenario where a developer is creating a SaaS platform for product recommendations that allows end users (merchants) to recommend products to their shoppers. In this scenario, each merchant will upload and work only with their own data.

One option is for the developer to create a dedicated collection for each merchant's dataset. However, as the number of merchants grows, so does the number of collections, potentially leading to performance bottlenecks and increased operational complexity. This leads us to an important architectural question: Should you use "Multi-tenancy" or "One collection per dataset"?

Choosing the right architecture

Multi-tenancy vs "One collection per dataset" architecture

When designing a vector database collection definition (data schema) in Weaviate, you must decide between multi-tenancy (storing data for multiple tenants in a single collection) or creating separate collections for each dataset (the "One collection per dataset" strategy). Each approach has its own advantages and trade-offs, especially in terms of performance, scalability, and management.

This guide aims to clarify these concepts and highlight the implications of each approach, focusing on the benefits and drawbacks:

"One collection per dataset" architecture

In this approach, each dataset is assigned a dedicated collection to ensure the separation of data between them. Before the implementation of multi-tenancy in Weaviate, this was the best approach for managing multiple datasets.

When a book store registers on our platform, we create the collection BookStoreProducts. This allows the store to customize the collection and add properties that are specific to their e-commerce platform, like author, title, genre, etc.

Creating a new collection per dataset (ShoeStoreProducts, GameStoreProducts, etc.) might seem like a simple and effective way to maintain data isolation. However, as the platform scales, this approach quickly encounters significant challenges.

"One collection per dataset" architecture example
"One collection per dataset" architecture example.

Advantages

  • Customizability: Collection definition changes or optimizations can be tailored to individual collections without affecting others.
  • Data isolation: Datasets are completely separated through the use of dedicated collections.

Challenges

  • Resource overhead: Each collection requires its own definition, indexes, and storage, leading to increased memory and disk usage. Managing managing millions or even thousands of collections becomes nearly impossible.
  • Operational complexity: Collection definition changes must be applied individually to each collection. Every collection must be updated separately and this takes a lot of time and computational effort.
tip

If you are creating more than 20 collections, take a moment to consider if multi-tenancy might be utilized.

Multi-tenancy architecture

Multi-tenancy refers to the practice of dividing a single collection to multiple datasets (tenants). Each tenant’s data is logically isolated through the use of tenant names. Multi-tenancy is especially useful when you want to store data for multiple customers or when you want to store similarly structured data for multiple projects.


Each tenant is identified by their name, ensuring that their products remain logically separated within the same collection. When a "book store" registers on our platform, we can create a new tenant called BookStore in the collection Products.

Queries can also be filtered based on the name to retrieve only the relevant data.

Multi-tenancy architecture example
Multi-tenancy architecture example.

Advantages

Use multi-tenancy when you need to support a large number of tenants and prioritize resource efficiency and scalability.

  • Easier collection definition management: Definition updates apply universally to all tenants. For example, adding a new property to all products is now much easier.
  • Index scalability: Indexes can be optimized for a single collection rather than fragmented across multiple collections. Each tenant has a dedicated, high-performance vector index, which results in faster query speeds. Instead of searching a shared index space, each tenant responds as if it were the only user on the cluster.
  • Data isolation: Each tenant’s data is completely segregated. This also means that data deletion is much easier and faster.

Challenges

  • Access control complexity: Fine-grained access control must be implemented to ensure data isolation between tenants.
  • Uniform collection definition: All tenants must share the same collection schema and configuration.
Reduce storage costs

You can change the state of a tenant into inactive (stored locally on disk) or offloaded (stored on cloud storage) in order to save resources.


Find out more about managing multi-tenancy in How to: Multi-tenancy operations.

Conclusion

While the choice between multi-tenancy and dedicated collections depends on your specific use case, the substantial performance benefits of multi-tenancy make it the preferred approach for most scenarios. With multi-tenancy, you gain significant resource efficiency by reducing indexing overhead and streamlining collection definition updates across all tenants.

Although dedicated collections can offer enhanced data isolation and flexibility in certain cases, their operational complexity and increased resource demands often outweigh these benefits. Regularly monitoring query performance, index size, and resource utilization is crucial to fine-tune your architecture, ensuring it meets both current and future needs.

Further resources

To find out more about multi-tenancy, visit the following pages:

Questions and feedback

If you have any questions or feedback, let us know in the user forum.