Ever wonder how ChatGPT streams responses token-by-token? Or how to push real-time data from a server to a browser? The secret is Server-Sent Events (SSE).
I built an SSE server from scratch (now streaming LLM responses!) to understand how it works. Turns out, it's ridiculously simple.
What's SSE?
SSE is a way for servers to push data to browsers. Unlike WebSockets, it's:
- One-way: Server → Client only
- Just HTTP: No special protocol, no handshake dance
- Auto-reconnect: Browser handles reconnection automatically
- Simple: You can implement it in 20 lines of code
Use cases:
- LLM streaming (ChatGPT-style responses)
- Live notifications
- Stock tickers
- Social media feeds
- Log streaming
- Any "subscribe to updates" scenario
The Entire Protocol in 30 Seconds
Here's how SSE works:
1. Client makes a regular HTTP request:
const eventSource = new EventSource('/events');2. Server responds with special headers:
HTTP/1.1 200 OK
Content-Type: text/event-stream ← This is the magic
Cache-Control: no-cache
Connection: keep-alive3. Server keeps sending data:
data: Hello!\n
\n
data: Another message\n
\n
data: Keep going...\n
\nThat's it. The connection stays open, and the server pushes messages whenever it wants.
The Code
Here's the core of an SSE server (streaming LLM responses):
def stream_events(client, prompt):
# SSE headers - this is what makes it work
response = "HTTP/1.1 200 OK\r\n"
response += "Content-Type: text/event-stream\r\n" # Magic!
response += "Cache-Control: no-cache\r\n"
response += "\r\n"
client.sendall(response.encode())
# Connect to LM Studio (or any OpenAI-compatible API)
llm_client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")
# Stream LLM response token-by-token
stream = llm_client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
client.sendall(f"data: {token}\n\n".encode())On the client side:
const events = new EventSource('/events?prompt=' + userInput);
events.onmessage = function(e) {
// Each token arrives here as it's generated
outputDiv.innerHTML += e.data; // Accumulate the response
};Browser does all the heavy lifting. It keeps the connection open and fires onmessage for each token as it streams from the LLM.
SSE Message Format
Messages are dead simple:
data: your message here\n
\n- Start with
data: - End with
\n\n(two newlines)
For multi-line messages:
data: line one\n
data: line two\n
\nYou can also add event types and IDs:
event: notification\n
id: 42\n
data: You have a new message!\n
\nThe client can listen for specific events:
events.addEventListener('notification', function(e) {
alert(e.data);
});SSE vs WebSockets
| Feature | SSE | WebSockets |
|---|---|---|
| Direction | Server → Client | Bidirectional |
| Protocol | HTTP | Custom (ws://) |
| Complexity | Simple | More complex |
| Reconnection | Automatic | Manual |
| Binary data | No (text only) | Yes |
| Browser support | All modern | All modern |
Use SSE when:
- You only need server-to-client
- You want simplicity
- You're already using HTTP
Use WebSockets when:
- You need client-to-server too (chat, games)
- You need binary data
- You need absolute minimum latency
Why SSE Is Underrated
- It's just HTTP — works through proxies, load balancers, firewalls
- No library needed —
EventSourceis built into browsers - Auto-reconnect — if connection drops, browser retries automatically
- Simple to debug — it's plain text, use curl or browser devtools
- No handshake — starts immediately, no back-and-forth
Why SSE Is Perfect for LLM Streaming
This is why every major LLM provider (OpenAI, Anthropic, Google) uses SSE for streaming:
- Token-by-token delivery — Each token is a separate SSE event, displayed immediately
- Built-in buffering — Browser handles all the complexity of assembling partial responses
- Automatic recovery — If connection drops mid-generation, browser can reconnect
- Unidirectional — LLMs only need server → client (no need for WebSocket bidirectional complexity)
- HTTP-based — Works everywhere, through all proxies and firewalls
The alternative would be:
- Long polling → Horrible latency, terrible UX
- WebSockets → Overkill for one-way streaming
- Wait for full response → No streaming feel, slow perceived performance
SSE is the sweet spot. That's why ChatGPT, Claude, and every other AI chat interface uses it under the hood.
Try It
First, set up LM Studio:
- Download from https://lmstudio.ai/
- Load any model (Llama, Mistral, Phi, etc.)
- Start the local server (defaults to port 1234)
Then:
pip install openai
python server.pyOpen http://localhost:8080, type a prompt, and click "Generate". Watch the LLM response stream in token-by-token, just like ChatGPT.
Open DevTools → Network tab. You'll see a single /events request that stays open. The "Size" column keeps growing as tokens stream in.
That's SSE powering real-time LLM responses.
The Bottom Line
SSE is HTTP's built-in streaming solution. If you need real-time server-to-client updates and don't want the complexity of WebSockets, SSE is your answer.
It's been around since 2006, it works in all browsers, and you can implement it from scratch in an afternoon.
Sometimes the simple solution is the right one.