Working with AI Chats

Updated Guide Sept 2025

Core Principles

LLMs are predictors, not logic engines. They generate plausible answers, not guaranteed truths.
Still like a brilliant, distractible assistant. Fast, knowledgeable, but can lose track, misremember, or follow your lead even when you’re wrong.

Token windows are expanding rapidly:
- GPT‑5: ~256k tokens (standard), higher in some API tiers.
- GPT‑4.1 / GPT‑4o-mini: up to 1M tokens in enterprise contexts.
- Claude 4 / Claude Sonnet 4: ~200k–1M tokens depending on variant and plan.
- Gemini 2.5 Flash/Pro: ~1M tokens input, ~65k output (Flash documented).
- Qwen 2.5: ~32k tokens for 72B Instruct.
- Perplexity Sonar: ~128k tokens context reported.
Context drops still occur when limits are exceeded—older messages get truncated.
Best practice: Don’t rely on infinite context. Chunk long work, summarize checkpoints, and use external docs.

Memory features are rolling out broadly. Many models now:
- Remember preferences and style between sessions.
- Can be put into “private” or “incognito” mode.
Safety tuning is stronger: models hedge more, avoid risky content, and may refuse edge cases that older versions answered.

Models remain agreeable—may mirror your errors.
They are less likely to argue, more likely to hedge.
Rule of thumb:
- Trust for brainstorming, summarization, code scaffolding.
- Verify for math, logic, critical facts.

Model	Max Tokens (Context Window)	Output Limit	Memory	Tools	Notes / Caveats
GPT‑3.5	~16k	~4k–8k typical	No	Limited	Fast, shallow; context limited vs newer models
GPT‑4‑turbo	~128k	~8k–16k typical	Yes	Yes	Stable for deep context; widely used baseline
GPT‑4.1 family	Up to ~1M (enterprise/API)	~32k output	Yes	Yes	Highest OpenAI context, but availability varies by plan
GPT‑5	~256k (standard)	Unclear; likely ~32k	Yes	Yes	Reported new limit; higher tiers may extend
GPT‑4o / 4o‑mini	~128k	Lower output in UI	Yes	Yes	Fast, multimodal; many free plans still capped lower
Claude 4 (Opus/Sonnet)	~200k–1M depending on tier	~64k output	Yes	Varies	Some docs cite 200k, Anthropic announced 1M for Sonnet 4; plan dependent
Gemini 2.5 Flash	1,048,576 input	~65k output	Yes	Yes	Google‑documented; API only; Flash = speed optimized
Gemini 2.5 Pro	~1M (reported)	Not clearly specified	Yes	Yes	Output caps unclear; varies by tier and preview vs GA
Qwen 2.5 (72B Instruct)	~32k	Not published	Varies	Varies	Numbers from model cards; enterprise variants may differ
Perplexity Sonar	~128k	Not published	Varies	Yes	Context limit reported; output cap undocumented

Important caveats:

Many limits apply only to paid/enterprise tiers.
Output often capped lower than input windows.
UI versions may expose far smaller windows than API.
“Million token” contexts are real but not always reliable in practice; performance can degrade.

Uploading many files increases confusion and token load.
Best practice:
- Work with 2–3 files at a time.
- Use phased comparisons (A vs B, then compare with C).
Modalities now matter:
- Images: OCR quality impacts parsing.
- Audio/video: transcription may drop detail.

Good at pattern math, bad at symbolic manipulation.
Multi-step logic still fragile without explicit scaffolding.
Better option:
- Ask for code to compute results.
- Use external math engines for reliability.

Costs: Longer contexts = higher cost. Be conscious of diminishing returns.
Privacy: Many providers now use chats for training unless you opt out. Review policies.
Safety: Guardrails can change without notice—what worked last month may be blocked today.

Tokenization still varies by model and platform.
Special characters, hyphens, and punctuation can inflate counts.
Symptoms of hitting limits:
- Truncated or partial code.
- Lost references to earlier variables.
Mitigation:
- Use explicit requests for “complete code”.
- Provide earlier snippets manually.
- Break large codebases into modules.