01 — Three doors
API vs Web Interface vs CLI
There are three ways to interact with AI. Most people only know one. Builders use all three.
🌐 Web Interface
ChatGPT.com, Claude.ai, Gemini.app. Lowest friction. Best for quick questions, exploration, learning. No automation, no integration, no settings control. You're a user on someone else's platform.
🔌 API
Send requests directly to the model via code. Full control: model, temperature, system prompt, tools, output format. Can be automated, integrated, scaled. Pay per token. You're a builder using raw materials.
⌨️ CLI
Terminal-based tools that wrap an API. Best for developers who live in the terminal. Often more powerful than web UI (scripting, piping). Examples: Claude Code, Hermes Agent, custom scripts.
When to use what:
| Situation | Use |
| Quick one-off question | Web UI |
| Learning, exploring | Web UI |
| Building an app | API |
| Automating workflows | API or CLI |
| Daily driving AI for work | CLI with tools |
| Complex multi-step projects | CLI with agent capabilities |
The progression most builders follow: start with web UI → discover API → build tools → adopt CLI agents. Each step gives you more power and more responsibility.
02 — Under the hood
Inference — what happens when you hit "send"
Inference is the technical term for "running the model", taking your input and generating an output. Understanding it helps you understand why AI is sometimes fast, sometimes slow, sometimes smart, sometimes dumb.
When you send a prompt, a server somewhere:
1
Loads the model
Billions of parameters loaded into GPU memory.
2
Processes input
Your input tokens pass through the model's layers.
3
Generates output
Output tokens produced one at a time.
4
Returns the result
The response comes back to you.
Why inference speed varies:
- Model size: Larger models (400B+ params) are slower than smaller ones (7B). More parameters = more computation per token.
- Hardware: NVIDIA H100 GPUs are faster than A100s, which are faster than consumer GPUs.
- Quantization: Compressed models run faster but with lower quality (more below).
- Load: When millions use ChatGPT at once, everyone gets slower responses.
The dirty secret of cloud AI: you're not always getting the same model quality. Providers may use quantized versions during peak hours, route to a less powerful fallback, or reduce active parameters. You have no visibility into this. Same API, same price, different quality.
03 — The map
Provider landscape
Closed-source providers (API access only):
| Provider | Models | Strengths | Weaknesses |
| OpenAI | GPT-4o, o1, o3 | Largest ecosystem, strong general | Expensive at frontier, closed |
| Anthropic | Claude (Opus, Sonnet, Haiku) | Safety, long context, careful reasoning | More cautious, smaller ecosystem |
| Google | Gemini | Multimodal, huge context, Google integration | Inconsistent, sometimes generic |
Open-source providers (you can self-host):
| Provider | Notable Models | Notes |
| Meta | Llama 4 | Best open-source foundation models |
| Mistral | Mistral, Mixtral | Strong European alternative |
| DeepSeek | DeepSeek V3 | Chinese, competitive quality, very cheap |
| Qwen | Qwen 3 | Alibaba, strong multilingual |
Inference providers (host models for you):
| Provider | What they do | Why use them |
| OpenRouter | Route to many models | One API, many models, price comparison |
| Together AI | Fast open-source inference | Cheap, fast, good selection |
| Fireworks AI | Fast inference | Speed-optimized |
| Groq | Ultra-fast inference (custom chip) | Fastest available, limited models |
The economics: frontier model pricing (per 1M tokens, June 2026).
- Claude Opus: ~$15 input / $75 output
- GPT-4o: ~$2.50 input / $10 output
- Claude Sonnet: ~$3 input / $15 output
- DeepSeek V3: ~$0.27 input / $1.10 output
- Llama 4 (via Together): ~$0.90 input / $0.90 output
Output tokens are always more expensive than input tokens. Asking for concise output literally saves money.
⚡ Try this now
Open
openrouter.ai and look at the leaderboard. What's the cheapest model right now? The most expensive? The most popular? Five minutes here and you'll understand the landscape better than most people who use AI daily.
04 — Compression
Quantization — compressed models
Quantization reduces the precision of a model's parameters to make it smaller, faster, and cheaper to run, at the cost of some quality.
The analogy is a photograph. Original: 4000×3000 pixels, 12MB, full quality. Compressed: 1000×750 pixels, 2MB, smaller and faster to load but you lose fine detail. Quantization does the same thing to model weights.
Common quantization levels:
| Level | Precision | Size Reduction | Quality Impact |
| FP16 (original) | 16-bit | 1x (baseline) | Full quality |
| FP8 | 8-bit | ~2x smaller | Minimal loss |
| INT8 | 8-bit | ~2x smaller | Small loss |
| INT4 | 4-bit | ~4x smaller | Noticeable loss |
| INT2 | 2-bit | ~8x smaller | Significant loss |
✓ When quantization is fine
- Simple tasks (summarization, formatting, classification)
- Bulk processing where speed > peak quality
- Running on limited hardware (consumer GPUs, laptops)
✗ When to avoid it
- Complex reasoning tasks
- Nuanced creative work
- Tasks where accuracy is critical
05 — The fork
Open-source vs closed-source
One of the most important decisions in AI: do you use a proprietary model via API, or an open-source model you can host yourself?
Closed-source
Pros: Best quality, zero maintenance, always up to date.
Cons: Data goes to third party, can't customize, vendor lock-in, costs can scale unpredictably.
Open-source
Pros: Full control, data stays private, can fine-tune, no vendor lock-in, cheaper at scale.
Cons: Need infrastructure, quality gap with frontier models, maintenance burden.
The trend: the gap is closing fast. DeepSeek V3 and Qwen 3 are competitive with GPT-4o on many tasks. In 2026, the choice isn't "open-source is worse." It's "open-source requires more work but gives you more control."
06 — Build it
Build your first AI tool
You don't need to be a senior developer to build something useful with AI. You need to understand the loop.
Input→
Format prompt→
Send to API→
Get response→
Use output
Every AI tool, from a simple chatbot to a complex agent, follows this loop. The complexity comes from what you add around it.
1
Simple wrapper
Takes input, sends to API with a system prompt, displays response. Example: a customer support chatbot.
2
With context
Add: search your database for relevant info, combine input + context in the prompt. Example: an AI that answers questions about your docs.
3
With tools
Add: API requests tool calls (search, calculate), system executes them, feeds results back. Example: an AI assistant that can actually DO things.
4
With memory
Add: saves important facts to persistent storage, loads relevant memory each session. Example: a personal AI that knows your preferences and history.
You don't need to build the next ChatGPT. You need to build the tool that makes YOUR specific workflow 10x better.
⚡ Try this now
If you have an API key from OpenAI, Anthropic, or OpenRouter, run this in your terminal:
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-3.5-turbo","messages":[{"role":"user","content":"Hello"}]}'
You just called an AI model from the command line. That curl is every API tool you'll ever build, stripped to its core.
07 — Watch out
What can go wrong at this level
Once you start building, the mistakes shift from "wrong answer" to "broken system." Here are the four that bite builders most often.
1. Picking a provider on hype
You hear "Claude is better than GPT" and switch immediately. It's not that simple. Every provider trades something: speed against quality, price against context window, privacy against features.
How to avoid: Test it yourself. Send the same task to three providers and compare. What's best for someone else may be wrong for your workload.
2. Not tracking costs
You route everything through GPT-4o, including summarization tasks a tiny model could handle. The bill runs 10x higher than it needs to. Worse, you don't even know how many tokens each request burns.
How to avoid: Monitor usage. Use cheap models for simple work (translate, summarize, format) and reserve expensive models for hard reasoning (code review, analysis).
3. Not handling errors
You build an app that calls an AI API. You don't handle rate limits, timeouts, malformed responses, or network failures. The first hiccup and your app crashes for the user.
How to avoid: Always handle the failure paths: try/catch, retry with exponential backoff, timeouts, and a fallback response. AI APIs are unreliable by nature. Plan for it.
4. Hardcoding API keys
You paste an API key straight into code and push it to GitHub. Within seconds, automated bots scan it. Within minutes, someone is burning your key to generate content. Bills can climb into the thousands before you notice.
How to avoid: Always load keys from environment variables. Add .env to your .gitignore. If a key ever leaks, revoke it immediately from the provider dashboard.
08 — What you now know
What you should know after Level 3
You now understand the builder's perspective. Tap each as it clicks:
You have the knowledge to start building. The tools are accessible. The APIs are well-documented. The models are capable. The only thing left is to actually build something.