Setup local offline AI server

January 21, 2026 · 2 min read

check amount video-ram you have
first try model a quarter of your memory
setup Open WebUI

1. check amout of video-ram

Varies depending on a lot of things.

# manual and probably will work
lspci | grep VGA
lspci -v -s 00:02.0 | grep "Memory at" 

# e.g.
# Memory at 601c000000 (64-bit, non-prefetchable) [size=16M]
# Memory at 4000000000 (64-bit, prefetchable) [size=256M]
#
# These are two GPU, one is probably the motherboard GPU
# the other is the CPU's GPU (they are very weak)

# for iGPU, like mine
# The "16GB" number: This is the Maximum Shared Memory. 
# Linux drivers typically allow the GPU to borrow up to roughly half of your total System RAM. 
# If your laptop has 32GB of RAM, Linux tells the GPU "You can have up to 16GB if you need it." If you have 16GB of RAM, it might be misreporting or showing a theoretical limit, but it can usually only pull ~8GB.

# NVIDIA
nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits

# AMD / Intel
lspci -v -s $(lspci | grep VGA | cut -d" " -f 1)

# cross (seems to not to work)
# glxinfo | grep -E 'Video memory|Dedicated video memory'

2. what to try

First, I'll try a 4GB model like llama3.2:3b.

# 1. install ollama utils command
curl -fsSL https://ollama.com/install.sh | sh
# /etc/systemd/system/ollama.service

# "Try to put every single layer into the GPU. If they don't all fit, put as many as you can"
export OLLAMA_NUM_GPU=999
# to run and to jump back when exited
ollama run llama3.2:3b

# you can now try it in terminal

3. Open WebUI

https://docs.openwebui.com/getting-started/quick-start

# install docker in fedora
sudo dnf config-manager addrepo --from-repofile https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl enable --now docker
sudo docker run hello-world

# pull (download) open web ui image
docker pull ghcr.io/open-webui/open-webui:main

# -v open-webui:/app/backend/data
# Volume Mapping, Ensures persistent storage of your data. This prevents data loss between container restarts.
# -p 3000:8080
# Port Mapping, Exposes the WebUI on port 3000 of your local machine.
docker run -d -p 3000:8080 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

# http://localhost:3000