Agent Readout
How to Build a Streaming Chat App With GeneralCompute + Node.js

A practical guide to building a streaming chat application using the OpenAI SDK for Node.js pointed at GeneralCompute's API, including SSE architecture, context management, and a full working example.
Author: General Compute
Published: 2026-06-18
Tags: nodejs, streaming, tutorial, openai sdk, sse, chat
Markdown body


Streaming chat responses feel dramatically better than waiting for the full reply to land at once. Users see text appearing as it generates, which makes even a 2-second response feel responsive. This guide walks through building a streaming chat app backed by GeneralCompute, using the `openai` Node.js package and Server-Sent Events (SSE) to push tokens from server to browser in real time.

The OpenAI SDK for Node.js works with GeneralCompute out of the box. Because GeneralCompute exposes an OpenAI-compatible API, you swap one base URL and your existing code keeps working -- with meaningfully faster token generation.

## What We're Building

A Node.js/Express server that:

1. Accepts a POST request with a conversation history
2. Calls GeneralCompute's streaming completions endpoint using the `openai` SDK
3. Pipes the token stream back to the client over SSE

A minimal browser frontend that:

1. Sends the user's message to the server
2. Opens an SSE connection and renders tokens as they arrive
3. Maintains conversation history for multi-turn context

By the end you'll have a working chat app you can extend with any model GeneralCompute supports.

## Prerequisites

- Node.js 18 or later (native `fetch` and `ReadableStream` support)
- A GeneralCompute API key (get one at [generalcompute.com](https://generalcompute.com))
- Basic familiarity with Express

## Project Setup

```bash
mkdir gc-streaming-chat && cd gc-streaming-chat
npm init -y
npm install express openai dotenv
```

Create a `.env` file:

```
GENERALCOMPUTE_API_KEY=your_api_key_here
PORT=3000
```

Your directory structure will look like this:

```
gc-streaming-chat/
  server.js
  public/
    index.html
  .env
```

## The Server

### Initializing the OpenAI Client

The `openai` package accepts a `baseURL` option. Point it at GeneralCompute and it handles the rest:

```javascript
// server.js
import OpenAI from "openai";
import express from "express";
import "dotenv/config";

const client = new OpenAI({
  apiKey: process.env.GENERALCOMPUTE_API_KEY,
  baseURL: "https://api.generalcompute.com/v1",
});

const app = express();
app.use(express.json());
app.use(express.static("public"));
```

### The Streaming Chat Endpoint

```javascript
app.post("/chat", async (req, res) => {
  const { messages } = req.body;

  if (!messages || !Array.isArray(messages)) {
    return res.status(400).json({ error: "messages array required" });
  }

  // SSE headers
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  res.flushHeaders();

  try {
    const stream = await client.chat.completions.create({
      model: "llama-4-maverick",
      messages,
      stream: true,
    });

    for await (const chunk of stream) {
      const delta = chunk.choices[0]?.delta?.content;
      if (delta) {
        // SSE format: "data: <payload>\n\n"
        res.write(`data: ${JSON.stringify({ token: delta })}\n\n`);
      }
    }

    // Signal completion
    res.write("data: [DONE]\n\n");
    res.end();
  } catch (err) {
    console.error("Stream error:", err);
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
});

app.listen(process.env.PORT, () => {
  console.log(`Server running on http://localhost:${process.env.PORT}`);
});
```

A few things worth noting:

- `res.flushHeaders()` sends the response headers immediately so the browser knows a stream is coming before any tokens arrive.
- The `for await...of` loop over the stream object is the cleanest way to consume it. The SDK handles buffering and reconnection internally.
- We serialize each token as a small JSON object (`{ token: delta }`) so you can extend it later with metadata (e.g., model name, token counts) without breaking the client parser.
- The `[DONE]` sentinel follows the same convention as the OpenAI API so any existing SSE client code works without modification.

## The Frontend

The browser side has three responsibilities: send messages, display the stream, and maintain history.

```html
<!-- public/index.html -->
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>GC Streaming Chat</title>
    <style>
      body { font-family: system-ui, sans-serif; max-width: 700px; margin: 40px auto; padding: 0 16px; }
      #messages { border: 1px solid #ddd; border-radius: 8px; padding: 16px; min-height: 300px; margin-bottom: 16px; }
      .message { margin-bottom: 12px; line-height: 1.5; }
      .user { font-weight: 600; }
      .assistant { color: #333; }
      #input-row { display: flex; gap: 8px; }
      #user-input { flex: 1; padding: 8px 12px; font-size: 1rem; border: 1px solid #ddd; border-radius: 6px; }
      button { padding: 8px 16px; background: #0070f3; color: white; border: none; border-radius: 6px; cursor: pointer; }
      button:disabled { opacity: 0.5; cursor: default; }
    </style>
  </head>
  <body>
    <h2>Streaming Chat -- GeneralCompute</h2>
    <div id="messages"></div>
    <div id="input-row">
      <input id="user-input" type="text" placeholder="Ask something..." />
      <button id="send-btn" onclick="sendMessage()">Send</button>
    </div>

    <script>
      const messagesEl = document.getElementById("messages");
      const inputEl = document.getElementById("user-input");
      const sendBtn = document.getElementById("send-btn");

      // Conversation history kept in memory
      const history = [
        {
          role: "system",
          content: "You are a helpful assistant. Be concise.",
        },
      ];

      function appendMessage(role, content) {
        const div = document.createElement("div");
        div.className = `message ${role}`;
        div.dataset.role = role;
        div.textContent = role === "user" ? `You: ${content}` : `Assistant: ${content}`;
        messagesEl.appendChild(div);
        messagesEl.scrollTop = messagesEl.scrollHeight;
        return div;
      }

      async function sendMessage() {
        const text = inputEl.value.trim();
        if (!text) return;

        inputEl.value = "";
        sendBtn.disabled = true;

        // Add user message to history and UI
        history.push({ role: "user", content: text });
        appendMessage("user", text);

        // Create a placeholder for the assistant reply
        const assistantDiv = appendMessage("assistant", "");
        let assistantText = "";

        try {
          const response = await fetch("/chat", {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({ messages: history }),
          });

          const reader = response.body.getReader();
          const decoder = new TextDecoder();

          while (true) {
            const { done, value } = await reader.read();
            if (done) break;

            const raw = decoder.decode(value, { stream: true });
            // Each chunk may contain multiple SSE lines
            for (const line of raw.split("\n")) {
              if (!line.startsWith("data: ")) continue;
              const payload = line.slice(6).trim();
              if (payload === "[DONE]") break;

              try {
                const { token, error } = JSON.parse(payload);
                if (error) throw new Error(error);
                assistantText += token;
                assistantDiv.textContent = `Assistant: ${assistantText}`;
                messagesEl.scrollTop = messagesEl.scrollHeight;
              } catch {
                // ignore parse errors on partial chunks
              }
            }
          }

          // Save completed assistant reply to history
          history.push({ role: "assistant", content: assistantText });
        } catch (err) {
          assistantDiv.textContent = `Assistant: [Error: ${err.message}]`;
        } finally {
          sendBtn.disabled = false;
          inputEl.focus();
        }
      }

      inputEl.addEventListener("keydown", (e) => {
        if (e.key === "Enter" && !sendBtn.disabled) sendMessage();
      });
    </script>
  </body>
</html>
```

The key pattern here is using `response.body.getReader()` rather than the `EventSource` API. `EventSource` is simpler but only supports GET requests, which means you can't pass the message history in a request body. The `ReadableStream` approach gives you full control over the request while still processing the SSE format.

## Context Management

Multi-turn chat requires sending the full conversation history on every request. The history array above grows linearly with the conversation length, which matters for two reasons:

1. **Prompt tokens cost money.** Every message in history is re-sent and re-processed each turn.
2. **Context windows have limits.** Most models cap out between 32K and 128K tokens. A long conversation will eventually overflow.

### Simple Sliding Window

For most apps, a sliding window works well. Keep the system message, drop the oldest user/assistant pairs when history exceeds a token budget:

```javascript
function trimHistory(messages, maxTurns = 20) {
  const system = messages.filter((m) => m.role === "system");
  const conversation = messages.filter((m) => m.role !== "system");

  // Keep the most recent maxTurns * 2 messages (user + assistant pairs)
  const trimmed = conversation.slice(-maxTurns * 2);
  return [...system, ...trimmed];
}
```

Call this before sending to the server:

```javascript
const response = await fetch("/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ messages: trimHistory(history) }),
});
```

### Summarization for Longer Sessions

For sessions that need to persist longer context, you can periodically summarize old turns:

```javascript
async function summarizeOldTurns(messages) {
  const toSummarize = messages.slice(1, -10); // Keep system + last 10
  const recent = messages.slice(-10);

  if (toSummarize.length < 6) return messages; // Not worth summarizing yet

  const summaryResponse = await client.chat.completions.create({
    model: "llama-4-scout", // Cheaper model for summarization
    messages: [
      {
        role: "user",
        content: `Summarize this conversation in 3-4 sentences:\n\n${toSummarize
          .map((m) => `${m.role}: ${m.content}`)
          .join("\n")}`,
      },
    ],
  });

  const summary = summaryResponse.choices[0].message.content;

  return [
    messages[0], // system
    { role: "system", content: `Earlier conversation summary: ${summary}` },
    ...recent,
  ];
}
```

This approach costs one extra API call every N turns but keeps context manageable for hour-long sessions.

## Error Handling and Reconnection

Production apps need to handle mid-stream failures gracefully. The server can drop the connection on network hiccups. A simple retry wrapper on the client covers most cases:

```javascript
async function fetchWithRetry(url, options, maxRetries = 2) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url, options);
      if (!response.ok) throw new Error(`HTTP ${response.status}`);
      return response;
    } catch (err) {
      if (attempt === maxRetries) throw err;
      await new Promise((r) => setTimeout(r, 500 * (attempt + 1)));
    }
  }
}
```

On the server side, catch stream errors explicitly and close the SSE connection cleanly rather than leaving it hanging:

```javascript
req.on("close", () => {
  // Client disconnected -- abort the stream if still running
  stream.controller.abort();
});
```

## Running the App

```bash
node server.js
```

Open `http://localhost:3000` in a browser and start chatting. Tokens stream in as they generate.

## Choosing a Model

GeneralCompute supports a range of models at different speed/capability trade-offs. For a chat app, a few good options:

| Model | Best for |
|---|---|
| `llama-4-maverick` | General chat, good balance of quality and speed |
| `llama-4-scout` | Faster responses, lower cost, great for lighter tasks |
| `qwen3-coder` | Technical or coding-heavy conversations |

Swap the model string in the server to try different ones without touching anything else.

## What to Build Next

This example covers the core loop. A few natural extensions:

- **Persistence**: Store history in Redis or a database so conversations survive page reloads
- **Streaming to multiple clients**: Use a message bus to fan a single inference stream to multiple WebSocket connections
- **Tool calling**: Add function definitions to the messages array and handle `tool_calls` deltas in the stream loop
- **Rate limiting**: Add per-user request throttling before deploying publicly

If you're building something production-ready, [GeneralCompute's API docs](https://generalcompute.com/docs) cover the full completions API, available models, and rate limits. The API key that works in this example is the same one you'd use in any OpenAI SDK integration.