Updated Guide Sept 2025
Core Principles
- LLMs are predictors, not logic engines. They generate plausible answers, not guaranteed truths.
- Still like a brilliant, distractible assistant. Fast, knowledgeable, but can lose track, misremember, or follow your lead even when you’re wrong.
Tokens & Context
-
Token windows are expanding rapidly:
- GPT‑5: ~256k tokens (standard), higher in some API tiers.
- GPT‑4.1 / GPT‑4o-mini: up to 1M tokens in enterprise contexts.
- Claude 4 / Claude Sonnet 4: ~200k–1M tokens depending on variant and plan.
- Gemini 2.5 Flash/Pro: ~1M tokens input, ~65k output (Flash documented).
- Qwen 2.5: ~32k tokens for 72B Instruct.
- Perplexity Sonar: ~128k tokens context reported.
-
Context drops still occur when limits are exceeded—older messages get truncated.
-
Best practice: Don’t rely on infinite context. Chunk long work, summarize checkpoints, and use external docs.
Model Behavior
What’s Stable
- Sessions need focus.
- Breaking tasks into small steps works best.
- Reinforcing key info improves consistency.
What’s Changed
-
Memory features are rolling out broadly. Many models now:
- Remember preferences and style between sessions.
- Can be put into “private” or “incognito” mode.
-
Safety tuning is stronger: models hedge more, avoid risky content, and may refuse edge cases that older versions answered.
Confidence vs Accuracy
-
Models remain agreeable—may mirror your errors.
-
They are less likely to argue, more likely to hedge.
-
Rule of thumb:
- Trust for brainstorming, summarization, code scaffolding.
- Verify for math, logic, critical facts.
Choosing the Right Model ( Sept 2025 Landscape)
| Model | Max Tokens (Context Window) | Output Limit | Memory | Tools | Notes / Caveats |
|---|---|---|---|---|---|
| GPT‑3.5 | ~16k | ~4k–8k typical | No | Limited | Fast, shallow; context limited vs newer models |
| GPT‑4‑turbo | ~128k | ~8k–16k typical | Yes | Yes | Stable for deep context; widely used baseline |
| GPT‑4.1 family | Up to ~1M (enterprise/API) | ~32k output | Yes | Yes | Highest OpenAI context, but availability varies by plan |
| GPT‑5 | ~256k (standard) | Unclear; likely ~32k | Yes | Yes | Reported new limit; higher tiers may extend |
| GPT‑4o / 4o‑mini | ~128k | Lower output in UI | Yes | Yes | Fast, multimodal; many free plans still capped lower |
| Claude 4 (Opus/Sonnet) | ~200k–1M depending on tier | ~64k output | Yes | Varies | Some docs cite 200k, Anthropic announced 1M for Sonnet 4; plan dependent |
| Gemini 2.5 Flash | 1,048,576 input | ~65k output | Yes | Yes | Google‑documented; API only; Flash = speed optimized |
| Gemini 2.5 Pro | ~1M (reported) | Not clearly specified | Yes | Yes | Output caps unclear; varies by tier and preview vs GA |
| Qwen 2.5 (72B Instruct) | ~32k | Not published | Varies | Varies | Numbers from model cards; enterprise variants may differ |
| Perplexity Sonar | ~128k | Not published | Varies | Yes | Context limit reported; output cap undocumented |
Important caveats:
- Many limits apply only to paid/enterprise tiers.
- Output often capped lower than input windows.
- UI versions may expose far smaller windows than API.
- “Million token” contexts are real but not always reliable in practice; performance can degrade.
File & Data Handling
-
Uploading many files increases confusion and token load.
-
Best practice:
- Work with 2–3 files at a time.
- Use phased comparisons (A vs B, then compare with C).
-
Modalities now matter:
- Images: OCR quality impacts parsing.
- Audio/video: transcription may drop detail.
Math & Logic (Still Weak Spots)
-
Good at pattern math, bad at symbolic manipulation.
-
Multi-step logic still fragile without explicit scaffolding.
-
Better option:
- Ask for code to compute results.
- Use external math engines for reliability.
Practical Tips
- Be explicit about output form (“full response, no summary”).
- Use structured prompts with headings/checkpoints.
- Offload long-running work to shared docs (Google Docs, Notion, etc.).
- Reset threads if drift/contradictions appear.
Memory & Personalization
- Memory is now active by default in many systems.
- Check settings: you can view, edit, or wipe memory.
- Use memory deliberately (preferences, style notes, recurring project context).
- Prune old/irrelevant data regularly.
Considerations
- Costs: Longer contexts = higher cost. Be conscious of diminishing returns.
- Privacy: Many providers now use chats for training unless you opt out. Review policies.
- Safety: Guardrails can change without notice—what worked last month may be blocked today.
Appendix: Tokenization Nuances
-
Tokenization still varies by model and platform.
-
Special characters, hyphens, and punctuation can inflate counts.
-
Symptoms of hitting limits:
- Truncated or partial code.
- Lost references to earlier variables.
-
Mitigation:
- Use explicit requests for “complete code”.
- Provide earlier snippets manually.
- Break large codebases into modules.
Bottom Line
- Keep chats focused, chunk work, and verify important details.
- Know your model’s current limits, tools, and memory features.
- Treat AI as an assistant—not a calculator, not a database, not an oracle.