Skip to main content

Navigating the landscape

Access mode selection

When it comes to deploying AI models in your applications, choosing between commercial inference providers and local inference is a good step to narrow down your range of choices.

Each approach has distinct advantages and trade-offs that should be considered based on your specific use case.

Inference service providers

Examples: AWS, Cohere, Google, Microsoft Azure, OpenAI, etc.

Advantages:

  • No infrastructure management or hardware investment required
  • Access to proprietary, provider-specific models
    • Regular model updates & improvements
  • Automatic scaling to handle varying workloads

Disadvantages:

  • Generally higher costs than local or self-managed inference
  • Data privacy considerations (data leaves your environment)
  • Dependency on third-party service availability
    • What happens if the provider ceases to operate
  • Limited customization options

Local Inference

Example: Hugging Face Transformers / Accelerate, llama.cpp, Ollama, PyTorch Serve, TensorFlow Serving

Advantages:

  • Control data transmissions
  • No internet dependency for inference operations
  • Ability to deploy custom trained models

Disadvantages:

  • Higher upfront costs for hardware procurement
  • Unlikely to be able to access the latest, most powerful models
  • Responsibility for model updates and maintenance
  • Potential performance constraints based on available hardware

Decision Factors

Consider the following questions when making your selection:

  1. Data sensitivity: Does your application process confidential or regulated data? If your data cannot leave your premises, this may limit your options to particular inference providers, or local models.
  2. Inference volume: How many requests do you expect to process daily/monthly? For many, the up-front costs for local models may be prohibitive.
  3. Latency requirements: How time-sensitive are your model responses? If this is high, it may preclude many small-scale users from using local models.
  4. Budget constraints: Is your priority upfront savings or long-term cost optimization?
  5. Technical resources: Do you have the expertise to manage local model deployment, and potentially training?

Note that there isn’t a one-size-fits-all solution.

However, it may be that for getting started, using a commercial inference provider may be an easier, lower-friction choice.

How to read model cards

Model “cards” are to AI models what product labels or specification sheets are to regular products.

Model cards are supplied by the model provider to help you understand what the model is and how to best use it.

Examples of model cards

Model cards come in a variety of formats. See, for example, the following cards for embedding models:

And some cards for generative AI models:

Even from this small selection, you can see that these cards vary according to the hosting site and the model provider. This volume of information can be overwhelming, especially at first.

Model Cards

We will get into these in more detail in later courses. But here are some key parameters to look out for.

  1. Basic model information
    • Model name and version
    • Model type (generative, embedding, etc.)
    • Model size (parameter count) and architecture
    • Training data overview
    • Costs, if accessed through an API, or hardware requirements
  2. Technical specifications
    • Dimensionality (for embedding models)
    • Context length (for generative models)
    • Supported languages or modalities
  3. Performance metrics
    • Benchmark results
    • Known strengths and limitations
    • Downstream performance
  4. Usage information
    • Intended use cases
    • Implementation guidelines or code snippets
  5. Legal and ethical considerations
    • License type and usage restrictions
    • Potential biases or fairness considerations

Just reviewing these parameters will take you a long way towards identifying models that will suit your needs.

In the absence of other information, a good starting point would be to filter for:

  • The right model type
  • Suitable modality, language and context length or dimensionality
  • Ability to access model (inference provider / local inference)
  • License suitability

Then, select candidate models from a reputable model provider, or based on their benchmark performance.

Ultimately you may wish to evaluate the model’s, and your application’s, performance yourself.

But following these simple heuristics will get you a long way towards selecting a good baseline model or a set of baseline models.

Questions and feedback

If you have any questions or feedback, let us know in the user forum.