auto_tune

Auto-Tuning System - Self-Optimizing Tensor Library

Inspired by HuggingChat + ggml + Candle research

Features:

GPU Auto-Detection (detects VRAM, load, capabilities)
Adaptive Quantization (int8 for inference, fp32 for training)
Zero-Copy between Gleam and Rust NIFs
Auto Batch Size Optimizer (learns the best batch for the hardware)

Target: RTX 4090 24GB VRAM + 32GB RAM

Types

AutoTuner

</>

Auto-tuner state

pub type AutoTuner {
  AutoTuner(
    hardware: HardwareProfile,
    quant_mode: QuantMode,
    history: List(BatchResult),
    current_batch_size: Int,
  )
}

Constructors

AutoTuner(
  hardware: HardwareProfile,
  quant_mode: QuantMode,
  history: List(BatchResult),
  current_batch_size: Int,
)

BatchResult

</>

Result of a batch execution

pub type BatchResult {
  BatchResult(
    batch_size: Int,
    duration_ms: Float,
    throughput: Float,
  )
}

Constructors

BatchResult(
  batch_size: Int,
  duration_ms: Float,
  throughput: Float,
)

Device

</>

Detected device type

pub type Device {
  Cuda(gpu_id: Int, vram_gb: Float)
  Metal(device_id: Int)
  Cpu(cores: Int)
}

Constructors

```
Cuda(gpu_id: Int, vram_gb: Float)
```
```
Metal(device_id: Int)
```
```
Cpu(cores: Int)
```

HardwareProfile

</>

Complete hardware profile

pub type HardwareProfile {
  HardwareProfile(
    device: Device,
    total_vram_gb: Float,
    available_vram_gb: Float,
    total_ram_gb: Float,
    gpu_load_pct: Float,
    optimal_batch_size: Int,
  )
}

Constructors

HardwareProfile(
  device: Device,
  total_vram_gb: Float,
  available_vram_gb: Float,
  total_ram_gb: Float,
  gpu_load_pct: Float,
  optimal_batch_size: Int,
)

MemoryPressure

</>

pub type MemoryPressure {
  Low
  Medium
  High
  Critical
}

Constructors

```
Low
```
```
Medium
```
```
High
```
```
Critical
```

MemoryStrategy

</>

pub type MemoryStrategy {
  MemoryStrategy(
    batch_size_mult: Float,
    quant_mode: QuantMode,
    gc_aggressive: Bool,
  )
}

Constructors

MemoryStrategy(
  batch_size_mult: Float,
  quant_mode: QuantMode,
  gc_aggressive: Bool,
)

QuantContext

</>

Quantization context

pub type QuantContext {
  QuantContext(mode: QuantMode, scales: List(Float))
}

Constructors

QuantContext(mode: QuantMode, scales: List(Float))

QuantMode

</>

Quantization mode

pub type QuantMode {
  Inference
  Training
  Adaptive
}

Constructors

```
Inference
```
```
Training
```
```
Adaptive
```

Values

check_memory_pressure

</>

pub fn check_memory_pressure(
  hw: HardwareProfile,
) -> MemoryPressure

Checks memory pressure

compute_scale

</>

pub fn compute_scale(tensor: tensor.Tensor) -> Float

Computes scale for absmax quantization (int8)

detect_cpu_only

</>

pub fn detect_cpu_only() -> HardwareProfile

Creates auto-tuner for CPU-only

detect_hardware

</>

pub fn detect_hardware() -> HardwareProfile

Detects available hardware

get_memory_strategy

</>

pub fn get_memory_strategy(
  pressure: MemoryPressure,
) -> MemoryStrategy

Strategy based on memory pressure

main

</>

pub fn main() -> Nil

new

</>

pub fn new() -> AutoTuner

Creates new auto-tuner

new_quant_context

</>

pub fn new_quant_context(mode: QuantMode) -> QuantContext

Creates quantization context

profile

</>

pub fn profile(
  tuner: AutoTuner,
  batch_size: Int,
  duration_ms: Float,
) -> AutoTuner

Records execution result and optimizes

run_hardware_profile

</>

pub fn run_hardware_profile() -> Nil

Runs complete hardware profile

should_quantize

</>

pub fn should_quantize(
  ctx: QuantContext,
  is_inference: Bool,
) -> Bool

Decides quantization mode based on the operation