Quickstart

installed ollama through brew. To run the ollama service: brew services start ollama

The first model you ran: https://huggingface.co/meta-llama/Llama-2-7b ollama run llama2 Even more specifically: https://ollama.com/library/llama2:latest

More info about your specific model:

>>> /show info
  Model
    architecture        llama    
    parameters          6.7B     
    context length      4096     
    embedding length    4096     
    quantization        Q4_0     
 
  Capabilities
    completion    
 
  Parameters
    stop    "[INST]"      
    stop    "[/INST]"     
    stop    "<<SYS>>"     
    stop    "<</SYS>>"    
 
  License
    LLAMA 2 COMMUNITY LICENSE AGREEMENT            
    Llama 2 Version Release Date: July 18, 2023

What is Llama.cpp?

This is a C++ implementation of the LLaMA model inference engine that runs ENTIRELY ON CPU with optional GPU acceleration (like Metal on macOS) This is the GO-TO tool for running models on a laptop. Wiki Link: https://en.wikipedia.org/wiki/Llama.cpp GitHub Link: https://github.com/ggml-org/llama.cpp Random Blog Guide: https://blog.steelph0enix.dev/posts/llama-cpp-guide/

# Clone it
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
 
# Build it with Metal (Apple GPU acceleration)
make LLAMA_METAL=1
 
# Download a GGUF model (e.g. LLaMA 2 7B Q4_0 from Hugging Face)
# Example: llama-2-7b.Q4_0.gguf
 
# Run it
./main -m ./llama-2-7b.Q4_0.gguf -p "What is the meaning of life?"

MPS: Metal Performance Shaders

Apple’s framework for GPU-accelerated computations, like matrix multiplications and convolutions—basically, all the heavy lifting involved in running neural networks. Ollama supports MPS on macOS by default if your system has a compatible GPU, which yours does (Apple M3 Silicone)