Ollama is a lightweight tool that allows you to run large language models (LLMs) locally on your own server. It makes it easy to download, manage, and interact with AI models without relying on external cloud services. All data stays on your VPS.
This guide shows how to install and run Ollama on Debian/Ubuntu and RHEL (AlmaLinux, Rocky Linux) systems.
0. Prerequisites
OS requirements:
Ubuntu 22.04 or newer;
Debian 11 or newer;
AlmaLinux 8/9
Rocky Linux 8/9
Other requirements:
At least 8 GB RAM (16 GB or more is recommended);
At least 20 GB storage free; 50–100 GB recommended since model files consume most of the space;
SSH access to the server;
Root or sudo privileges;
Internet access to download models.
Please note: Ollama runs on CPU by default. Performance depends heavily on available RAM and CPU.
1. Install Ollama
Unlike many applications, Ollama does not require adding external APT repositories. Installation is done using the official install script.
1.1 Update your system
On Debian/Ubuntu run the following command to update the system:
apt update && apt upgrade -y
On RHEL (AlmaLinux, Rocky Linux) run this command:
dnf update -y
1.2 Download and run the Ollama install script
Run the following command:
curl -fsSL https://ollama.com/install.sh | sh
This script will:
Download the Ollama binary
Install it system-wide
Create a systemd service
Start the Ollama service automatically
The installation usually completes in a few seconds.
On some distributions, you will see the error:
If you see this error, run the suggested command for your OS.
Debian/Ubuntu:
apt-get install zstd
RHEL:
dnf install zstd -y
Then repeat the installation command:
curl -fsSL https://ollama.com/install.sh | sh
After installation is completed, you may see:
WARNING: No NVIDIA AMD GPU detected. Ollama will run in CPU-only mode.
This is normal, since GPU is not supported on VPS. Ollama automatically falls back to using the CPU, and it will still work correctly.
1.3 Verify the installation
Check that Ollama is installed correctly:
ollama --version
If a version number is displayed, Ollama is installed successfully.
2. Manage the Ollama service
Ollama runs as a background service using systems.
2.1 Check service status
systemctl status ollama
You should see that the service is active and running.
2.2 Start and enable Ollama (if needed)
If the service is not running, you can start it with this command:
systemctl start ollama
You can enable it on boot:
systemctl enable ollama
3. Managing models
3.1 Run your first model
For example, run a popular general-purpose model:
ollama run llama3
On first run:
The model will be downloaded automatically
The download may take several minutes
Model files may take several gigabytes of disk space. llama 3 size is 4.7GB.
After the download completes, you can interact with the model directly in the terminal.
3.2 How to Use a Model
After the model loads, you will see a prompt like:
>>>
This means the model is ready.
Type your question in plain English and press Enter.
Explain what a Linux service is in simple terms.
The model will generate an answer directly in the terminal.
The model remembers the context of the conversation while it is running.
When you are done, press on your keyboard:
Ctrl + D
This closes the session, but Ollama itself keeps running in the background.
3.3 Run other models
Examples:
ollama run mistral
ollama run gemma
ollama run codellama
ollama run phi
Smaller models are recommended for VPS servers with limited resources.
Mistral is a solid all-around model that works well on VPS servers without a GPU. It is fast enough for everyday use and gives good quality answers for explanations, summaries, and basic coding. If you want one model that can handle many tasks without being too heavy, Mistral is a safe choice.
Mistral typically require about 6–8 GB of RAM
Gemma is lighter and quicker, designed for systems with limited resources. It responds fast and uses less memory, but the answers are simpler and shorter. It works best for basic questions, small automation tasks, and situations where speed matters more than depth.
Gemma works well with around 3–4 GB of RAM
Code Llama is focused on programming. It is useful for writing code, explaining scripts, and fixing simple bugs, but it is not meant for general conversation. This model makes sense if your main goal is coding help on a server.
Code Llama requires about 6–8 GB of RAM.
Phi 3 Mini is very small and surprisingly capable for its size. It runs quickly even on weak VPS servers and is good at clear explanations, simple reasoning, and light coding tasks. If resources are tight, this is often the best model to start with.
Phi 3 Mini can run on as little as 2–3 GB of RAM
You can find all available Ollama models in the official Ollama model library on their website.
3.4 List installed models
ollama list
This shows all models currently downloaded on the server.
3.5 Remove a model
If you need to free disk space, you can remove a model by running this command:
ollama rm llama3
4. Ollama API access
Ollama automatically exposes a local API endpoint:
http://localhost:11434
This API can be used to:
See available models
Send prompts programmatically
Integrate Ollama with your applications
Run this command to check that the Ollama API is running correctly and to see which AI models are installed and available on the system:
curl http://localhost:11434/api/tags
Important:
By default, the API listens only on localhost. It is not accessible from outside the server, which is the safest setup.
5. Firewall considerations
If you plan to use Ollama only locally on the VPS, no firewall changes are required.
If you intend to expose the API externally:
Protect it with authentication
Restrict access by IP
Do not expose it directly to the public internet
Opening the API without protection is not recommended.
6. Common issues
6.1 Not enough RAM
Symptoms:
Model fails to load
Ollama process is killed
Server becomes unresponsive
Solutions:
6.2 Slow responses
This is expected on CPU-only systems. Ollama on a VPS is best suited for:
Testing
Learning
Internal tools
Low-volume automation




