GitHub - InftyAI/PUMA: Aim to be a lightweight, high-performance engine for local AI.

A lightweight, high-performance inference engine for local AI

✨ Features

🔧 Model Management - Download, cache, and organize AI models from Hugging Face

🔍 Advanced Filtering - Search models with regex patterns and SQL-style queries

💻 System Detection - Automatic GPU detection and resource reporting

🚀 OpenAI-Compatible API - RESTful API with streaming support

Installation

Install with Cargo

cargo install puma

Build from Source

# Clone the repository
git clone https://github.com/InftyAI/PUMA.git
cd PUMA

# Build the binary
make build

# The binary will be available at ./puma
./puma version

Quick Start

CLI Usage

# Download a model
puma pull inftyai/tiny-random-gpt2

# List all models
puma ls

# Inspect model details
puma inspect inftyai/tiny-random-gpt2

# Check system info
puma info

# Remove a model
puma rm inftyai/tiny-random-gpt2

API Server

# Start the inference server with a model
puma serve inftyai/tiny-random-gpt2

# Server will start on http://0.0.0.0:8000
# API endpoints:
#   POST /v1/chat/completions
#   POST /v1/completions
#   GET  /v1/models
#   GET  /v1/models/:model
#   GET  /health

Test the API:

# Health check
curl http://localhost:8000/health

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inftyai/tiny-random-gpt2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Or use the test script
./hack/scripts/test_api.sh

Commands

Command	Status	Description
`pull <model>`	✅	Download model from provider
`ls`	✅	List models (supports regex, label filters)
`inspect <model>`	✅	Show detailed model information
`rm <model>`	✅	Remove model and cache
`info`	✅	Display system information
`version`	✅	Show PUMA version
`serve <model>`	✅	Start OpenAI-compatible API server with a model
`ps`	🚧	List running models
`run`	🚧	Start model inference
`stop`	🚧	Stop running model

Advanced Usage

Pattern Matching

# Substring match
puma ls qwen

# Prefix match
puma ls "^inftyai/"

# Alternation
puma ls "llama-(2|3)"

Label Filtering

# Single filter
puma ls -l author=inftyai

# Multiple filters (AND condition)
puma ls -l author=inftyai,license=mit

# Combine pattern + filter
puma ls llama -l author=meta

Available filters: author, task, license, provider, model_series

API Server

PUMA provides an OpenAI-compatible API server for model inference.

Starting the Server

# Start server with a model (default: 0.0.0.0:8000)
puma serve inftyai/tiny-random-gpt2

# Custom host and port
puma serve inftyai/tiny-random-gpt2 --host 127.0.0.1 --port 3000

# Model must be pulled first
puma pull inftyai/tiny-random-gpt2

API Endpoints

Chat Completions (Recommended)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inftyai/tiny-random-gpt2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming (Server-Sent Events)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inftyai/tiny-random-gpt2",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

List Models

# Returns the currently loaded model
curl http://localhost:8000/v1/models

Health Check

curl http://localhost:8000/health
# Returns: {"status":"ok"}

OpenAI Python Client

PUMA is compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Not required
)

response = client.chat.completions.create(
    model="inftyai/tiny-random-gpt2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

Inspect Output

$ puma inspect inftyai/tiny-random-gpt2

name: inftyai/tiny-random-gpt2
kind: Model
spec:
  author:         inftyai
  model_series:   gpt2
  task:           text-generation
  license:        MIT
  context_window: 2.05K
  safetensors:
    total:        7.00B
    parameters:
      f32:        7.00B
  provider:     huggingface
  cache:
    revision:     abc123de
    size:         1.24 GB
    path:   ~/.puma/cache/...
status:
  created:      2 hours ago
  updated:      2 hours ago

Model Management

Database: ~/.puma/models.db (SQLite)
Cache: ~/.puma/cache/ (model files)

Models are stored with lowercase names for case-insensitive matching.

Development

# Build
make build

# Run all tests
make test

# Test API manually
./hack/scripts/test_api.sh

Project Structure

puma/
├── src/
│   ├── api/          # OpenAI-compatible API
│   ├── backend/      # Inference backends (Mock, MLX)
│   ├── cli/          # Command implementations
│   ├── downloader/   # HuggingFace download logic
│   ├── registry/     # Model registry & metadata
│   ├── storage/      # SQLite storage backend
│   ├── system/       # System info detection
│   └── utils/        # Formatting & helpers
├── tests/            # Integration tests
├── hack/             # Development scripts
├── Cargo.toml        # Rust dependencies
└── Makefile          # Build commands

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github		.github
hack		hack
site/images		site/images
src		src
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ Features

Installation

Install with Cargo

Build from Source

Quick Start

CLI Usage

API Server

Commands

Advanced Usage

Pattern Matching

Label Filtering

API Server

Starting the Server

API Endpoints

Chat Completions (Recommended)

Streaming (Server-Sent Events)

List Models

Health Check

OpenAI Python Client

Inspect Output

Model Management

Development

Project Structure

License

Star History

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✨ Features

Installation

Install with Cargo

Build from Source

Quick Start

CLI Usage

API Server

Commands

Advanced Usage

Pattern Matching

Label Filtering

API Server

Starting the Server

API Endpoints

Chat Completions (Recommended)

Streaming (Server-Sent Events)

List Models

Health Check

OpenAI Python Client

Inspect Output

Model Management

Development

Project Structure

License

Star History

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages