Ollama is a lightweight tool that allows you to run large language models (LLMs) locally on your own server. It makes it easy to download, manage, and interact with AI models without relying on external cloud services. All data stays on your VPS.

This guide shows how to install and run Ollama on Debian/Ubuntu and RHEL (AlmaLinux, Rocky Linux) systems.

0. Prerequisites

OS requirements:

Ubuntu 22.04 or newer;
Debian 11 or newer;
AlmaLinux 8/9
Rocky Linux 8/9

Other requirements:

At least 8 GB RAM (16 GB or more is recommended);
At least 20 GB storage free; 50–100 GB recommended since model files consume most of the space;
SSH access to the server;
Root or sudo privileges;
Internet access to download models.

Please note: Ollama runs on CPU by default. Performance depends heavily on available RAM and CPU.

1. Install Ollama

Unlike many applications, Ollama does not require adding external APT repositories. Installation is done using the official install script.

1.1 Update your system

On Debian/Ubuntu run the following command to update the system:

apt update && apt upgrade -y

On RHEL (AlmaLinux, Rocky Linux) run this command:

dnf update -y

1.2 Download and run the Ollama install script

Run the following command:

curl -fsSL https://ollama.com/install.sh | sh

This script will:

Download the Ollama binary
Install it system-wide
Create a systemd service
Start the Ollama service automatically

The installation usually completes in a few seconds.

On some distributions, you will see the error:

If you see this error, run the suggested command for your OS.

Debian/Ubuntu:

apt-get install zstd

RHEL:

dnf install zstd -y

Then repeat the installation command:

curl -fsSL https://ollama.com/install.sh | sh

After installation is completed, you may see:

WARNING: No NVIDIA AMD GPU detected. Ollama will run in CPU-only mode.

This is normal, since GPU is not supported on VPS. Ollama automatically falls back to using the CPU, and it will still work correctly.

1.3 Verify the installation

Check that Ollama is installed correctly:

ollama --version

If a version number is displayed, Ollama is installed successfully.

2. Manage the Ollama service

Ollama runs as a background service using systems.

2.1 Check service status

systemctl status ollama

You should see that the service is active and running.

2.2 Start and enable Ollama (if needed)

If the service is not running, you can start it with this command:

systemctl start ollama

You can enable it on boot:

systemctl enable ollama

3. Managing models

3.1 Run your first model

For example, run a popular general-purpose model:

ollama run llama3

On first run:

The model will be downloaded automatically
The download may take several minutes
Model files may take several gigabytes of disk space. llama 3 size is 4.7GB.

After the download completes, you can interact with the model directly in the terminal.

3.2 How to Use a Model

After the model loads, you will see a prompt like:

>>>

This means the model is ready.

Type your question in plain English and press Enter.

Explain what a Linux service is in simple terms.

The model will generate an answer directly in the terminal.

The model remembers the context of the conversation while it is running.

When you are done, press on your keyboard:

Ctrl + D

This closes the session, but Ollama itself keeps running in the background.

3.3 Run other models

Examples:

ollama run mistral
ollama run gemma
ollama run codellama
ollama run phi

Smaller models are recommended for VPS servers with limited resources.

Mistral is a solid all-around model that works well on VPS servers without a GPU. It is fast enough for everyday use and gives good quality answers for explanations, summaries, and basic coding. If you want one model that can handle many tasks without being too heavy, Mistral is a safe choice.

Mistral typically require about 6–8 GB of RAM

Gemma is lighter and quicker, designed for systems with limited resources. It responds fast and uses less memory, but the answers are simpler and shorter. It works best for basic questions, small automation tasks, and situations where speed matters more than depth.

Gemma works well with around 3–4 GB of RAM

Code Llama is focused on programming. It is useful for writing code, explaining scripts, and fixing simple bugs, but it is not meant for general conversation. This model makes sense if your main goal is coding help on a server.

Code Llama requires about 6–8 GB of RAM.

Phi 3 Mini is very small and surprisingly capable for its size. It runs quickly even on weak VPS servers and is good at clear explanations, simple reasoning, and light coding tasks. If resources are tight, this is often the best model to start with.

Phi 3 Mini can run on as little as 2–3 GB of RAM

You can find all available Ollama models in the official Ollama model library on their website.

3.4 List installed models

ollama list

This shows all models currently downloaded on the server.

3.5 Remove a model

If you need to free disk space, you can remove a model by running this command:

ollama rm llama3

4. Ollama API access

Ollama automatically exposes a local API endpoint:

http://localhost:11434

This API can be used to:

See available models
Send prompts programmatically
Integrate Ollama with your applications

Run this command to check that the Ollama API is running correctly and to see which AI models are installed and available on the system:

curl http://localhost:11434/api/tags

Important:
By default, the API listens only on localhost. It is not accessible from outside the server, which is the safest setup.

5. Firewall considerations

If you plan to use Ollama only locally on the VPS, no firewall changes are required.

If you intend to expose the API externally:

Protect it with authentication
Restrict access by IP
Do not expose it directly to the public internet

Opening the API without protection is not recommended.

6. Common issues

6.1 Not enough RAM

Symptoms:

Model fails to load
Ollama process is killed
Server becomes unresponsive

Solutions:

Use a smaller model
Add a SWAP file
Upgrade your VPS

6.2 Slow responses

This is expected on CPU-only systems. Ollama on a VPS is best suited for:

Testing
Learning
Internal tools
Low-volume automation

How To Install Cloudron on Ubuntu

How to Install Webmin on AlmaLinux 9 and Rocky Linux 9

DirectAdmin installation

How To Install Grafana on Ubuntu

How To Install Grafana on AlmaLinux and Rocky Linux

How to install Ollama on Linux