Where On-Device and Cloud LLMs Meet

Where On-Device and Cloud LLMs Meet

By ShiroKatsuya GitHub

-

Overview

what is Calista?

Calista is a research project focused on enabling collaboration between on-device language models and cloud-based models through the Moddel Context Protocol (MCP). It aims to reduce cloud costs while maintaining quality by allowing local models to handle long contexts.

how to use Calista?

To use Calista, clone the repository, set up a local model server (either ollama or tokasaurus), and configure your API keys for cloud LLM providers. You can then run the demo application to test the protocol.

key features of Calista?

  • Cost-efficient collaboration between on-device and cloud language models.
  • Support for multiple local model servers.
  • Real-time streaming of responses.
  • Automated architecture improvement through Neural Architecture Search (NAS).

use cases of Calista?

  1. Reducing cloud costs for language model applications.
  2. Enhancing the performance of on-device models with cloud capabilities.
  3. Automating the improvement of language model architectures.

FAQ from Calista?

  • What is the Moddel Context Protocol (MCP)?

MCP is a communication protocol that allows on-device models to collaborate with cloud models efficiently.

  • Is Calista free to use?

Yes! Calista is open-source and free to use.

  • What programming language is Calista implemented in?

Calista is implemented in Python.

Content

Minions Logo

Where On-Device and Cloud LLMs Meet

Discord

What is this? Minions is a communication protocol that enables small on-device models to collaborate with frontier models in the cloud. By only reading long contexts locally, we can reduce cloud costs with minimal or no quality degradation. This repository provides a demonstration of the protocol. Get started below or see our paper and blogpost below for more information.

Paper: Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

Blogpost: https://hazyresearch.stanford.edu/blog/2025-02-24-minions

Setup

We have tested the following setup on Mac and Ubuntu with Python 3.10-3.11 (Note: Python 3.13 is not supported)

Optional: Create a virtual environment with your favorite package manager (e.g. conda, venv, uv)
conda create -n minions python=3.11

Step 1: Clone the repository and install the Python package.

git clone https://github.com/HazyResearch/minions.git
cd minions
pip install -e .  # installs the minions package in editable mode

Step 2: Install a server for running the local model.

We support two servers for running local models: ollama and tokasaurus. You need to install at least one of these.

  • You should use ollama if you do not have access to NVIDIA GPUs. Install ollama following the instructions here. To enable Flash Attention, run launchctl setenv OLLAMA_FLASH_ATTENTION 1 and, if on a mac, restart the ollama app.
  • You should use tokasaurus if you have access to NVIDIA GPUs and you are running the Minions protocol, which benefits from the high-throughput of tokasaurus. Install tokasaurus with the following command:
uv pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ tokasaurus==0.0.1.post1

Step 3: Set your API key for at least one of the following cloud LLM providers.

If needed, create an OpenAI API Key or TogetherAI API key for the cloud model.

export OPENAI_API_KEY=<your-openai-api-key>
export TOGETHER_API_KEY=<your-together-api-key>

Minions Demo Application

Watch the video

To try the Minion or Minions protocol, run the following command:

streamlit run app.py

If you are seeing an error about the ollama client,

An error occurred: Failed to connect to Ollama. Please check that Ollama is downloaded, running and accessible. https://ollama.com/download

try running the following command:

OLLAMA_FLASH_ATTENTION=1 ollama serve

Example code: Minion (singular)

The following example is for an ollama local client and an openai remote client. The protocol is minion.

from minions.clients.ollama import OllamaClient
from minions.clients.openai import OpenAIClient
from minions.minion import Minion

local_client = OllamaClient(
        model_name="llama3.2",
    )

remote_client = OpenAIClient(
        model_name="gpt-4o",
    )

# Instantiate the Minion object with both clients
minion = Minion(local_client, remote_client)


context = """
Patient John Doe is a 60-year-old male with a history of hypertension. In his latest checkup, his blood pressure was recorded at 160/100 mmHg, and he reported occasional chest discomfort during physical activity.
Recent laboratory results show that his LDL cholesterol level is elevated at 170 mg/dL, while his HDL remains within the normal range at 45 mg/dL. Other metabolic indicators, including fasting glucose and renal function, are unremarkable.
"""

task = "Based on the patient's blood pressure and LDL cholesterol readings in the context, evaluate whether these factors together suggest an increased risk for cardiovascular complications."

# Execute the minion protocol for up to two communication rounds
output = minion(
    task=task,
    context=[context],
    max_rounds=2
)

Streaming Output

To enable real-time streaming of response output as it's being generated, you can use the stream_output parameter:

from minions.clients.openai import OpenAIClient
from minions.minion import Minion

# Enable streaming in the client
openai_client = OpenAIClient(
    model_name="gpt-4o",
    stream=True  # Enable streaming in the client
)

# Create the minion with streaming enabled
minion = Minion(
    local_client=openai_client,
    remote_client=openai_client,
    stream_output=True  # Enable streaming in the Minion
)

# Now responses will print incrementally in real-time
result = minion(
    task="Generate a short poem about artificial intelligence",
    context=["Make it thoughtful but accessible to general audiences"]
)

You can also provide a custom callback function for more control over how streaming content is displayed:

def custom_callback(agent_type, chunk, is_streaming=False, is_final=False):
    """Custom callback to handle streaming output."""
    if is_streaming:
        # Process streaming chunks in real-time
        # For example, you could format or colorize the output
        pass
    elif chunk:
        # Handle complete responses
        print(f"\n[{agent_type}] COMPLETE RESPONSE: {chunk}\n")

# Create the minion with custom callback
minion = Minion(
    local_client=openai_client,
    remote_client=openai_client,
    callback=custom_callback,
    stream_output=True
)

See the examples/stream_example.py file for a complete example.

Example Code: Minions (plural)

The following example is for an ollama local client and an openai remote client. The protocol is minions.

from minions.clients.ollama import OllamaClient
from minions.clients.openai import OpenAIClient
from minions.minions import Minions
from pydantic import BaseModel

class StructuredLocalOutput(BaseModel):
    explanation: str
    citation: str | None
    answer: str | None

local_client = OllamaClient(
        model_name="llama3.2",
        temperature=0.0,
        structured_output_schema=StructuredLocalOutput
)

remote_client = OpenAIClient(
        model_name="gpt-4o",
)


# Instantiate the Minion object with both clients
minion = Minions(local_client, remote_client)


context = """
Patient John Doe is a 60-year-old male with a history of hypertension. In his latest checkup, his blood pressure was recorded at 160/100 mmHg, and he reported occasional chest discomfort during physical activity.
Recent laboratory results show that his LDL cholesterol level is elevated at 170 mg/dL, while his HDL remains within the normal range at 45 mg/dL. Other metabolic indicators, including fasting glucose and renal function, are unremarkable.
"""

task = "Based on the patient's blood pressure and LDL cholesterol readings in the context, evaluate whether these factors together suggest an increased risk for cardiovascular complications."

# Execute the minion protocol for up to two communication rounds
output = minion(
    task=task,
    doc_metadata="Medical Report",
    context=[context],
    max_rounds=2
)

Python Notebook

To run Minion/Minions in a notebook, checkout minions.ipynb.

CLI

To run Minion/Minions in a CLI, checkout minions_cli.py.

minions --help
minions --context <path_to_context> --protocol <minion|minions>

Maintainers

Autonomous LLM Improvement through Neural Architecture Search

This project implements a Neural Architecture Search (NAS) system that enables a Large Language Model to autonomously improve its own architecture through evolutionary techniques.

Overview

The system uses an evolutionary algorithm to search for optimal hyperparameters and architecture configurations for fine-tuning LLMs. It automatically:

  1. Generates a population of diverse architecture configurations
  2. Trains and evaluates each architecture
  3. Selects the best performers
  4. Creates a new generation through mutation
  5. Repeats the process for multiple generations to find the optimal configuration

Key Features

  • Evolutionary Search: Uses selection and mutation to explore the architecture space
  • Parameter-Efficient Fine-Tuning: Leverages LoRA (Low-Rank Adaptation) techniques
  • Automated Evaluation: Models are automatically trained and evaluated
  • Progress Tracking: Saves results after each generation
  • Self-Improvement: The system continuously improves its own architecture

Search Space

The NAS explores the following architectural parameters:

  • LoRA rank (r)
  • LoRA alpha values
  • Target modules for fine-tuning
  • Learning rates
  • Batch sizes
  • Gradient accumulation steps
  • Use of rank-stabilized LoRA

Usage

Run the NAS system with:

python nas.py

You can modify the parameters in the __main__ section of nas.py to customize:

  • Base model
  • Dataset for fine-tuning
  • Population size
  • Number of generations
  • Sequence length
  • Output directory

How It Works

  1. Initialization: Creates a population of random architecture configurations
  2. Training: Each architecture is trained on the dataset
  3. Evaluation: Models are evaluated based on perplexity
  4. Selection: Top performers are selected for the next generation
  5. Mutation: New architectures are created by randomly modifying parameters
  6. Iteration: The process repeats for multiple generations

The best model is saved after each improvement, and a complete history of results is maintained in JSON format.

No tools information available.
No content found.