How to Build a Streaming Chat App With GeneralCompute + Node.js

Streaming chat responses feel dramatically better than waiting for the full reply to land at once. Users see text appearing as it generates, which makes even a 2-second response feel responsive. This guide walks through building a streaming chat app backed by GeneralCompute, using the openai Node.js package and Server-Sent Events (SSE) to push tokens from server to browser in real time.

The OpenAI SDK for Node.js works with GeneralCompute out of the box. Because GeneralCompute exposes an OpenAI-compatible API, you swap one base URL and your existing code keeps working -- with meaningfully faster token generation.

What We're Building

A Node.js/Express server that:

Accepts a POST request with a conversation history
Calls GeneralCompute's streaming completions endpoint using the openai SDK
Pipes the token stream back to the client over SSE

A minimal browser frontend that:

Sends the user's message to the server
Opens an SSE connection and renders tokens as they arrive
Maintains conversation history for multi-turn context

By the end you'll have a working chat app you can extend with any model GeneralCompute supports.

Prerequisites

Node.js 18 or later (native fetch and ReadableStream support)
A GeneralCompute API key (get one at generalcompute.com)
Basic familiarity with Express

Project Setup

mkdir gc-streaming-chat && cd gc-streaming-chat
npm init -y
npm install express openai dotenv

Create a .env file:

GENERALCOMPUTE_API_KEY=your_api_key_here
PORT=3000

Your directory structure will look like this:

gc-streaming-chat/
  server.js
  public/
    index.html
  .env

The Server

Initializing the OpenAI Client

The openai package accepts a baseURL option. Point it at GeneralCompute and it handles the rest:

// server.js
import OpenAI from "openai";
import express from "express";
import "dotenv/config";

const client = new OpenAI({
  apiKey: process.env.GENERALCOMPUTE_API_KEY,
  baseURL: "https://api.generalcompute.com/v1",
});

const app = express();
app.use(express.json());
app.use(express.static("public"));

The Streaming Chat Endpoint

app.post("/chat", async (req, res) => {
  const { messages } = req.body;

  if (!messages || !Array.isArray(messages)) {
    return res.status(400).json({ error: "messages array required" });
  }

  // SSE headers
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  res.flushHeaders();

  try {
    const stream = await client.chat.completions.create({
      model: "llama-4-maverick",
      messages,
      stream: true,
    });

    for await (const chunk of stream) {
      const delta = chunk.choices[0]?.delta?.content;
      if (delta) {
        // SSE format: "data: <payload>\n\n"
        res.write(`data: ${JSON.stringify({ token: delta })}\n\n`);
      }
    }

    // Signal completion
    res.write("data: [DONE]\n\n");
    res.end();
  } catch (err) {
    console.error("Stream error:", err);
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
});

app.listen(process.env.PORT, () => {
  console.log(`Server running on http://localhost:${process.env.PORT}`);
});

A few things worth noting:

res.flushHeaders() sends the response headers immediately so the browser knows a stream is coming before any tokens arrive.
The for await...of loop over the stream object is the cleanest way to consume it. The SDK handles buffering and reconnection internally.
We serialize each token as a small JSON object ({ token: delta }) so you can extend it later with metadata (e.g., model name, token counts) without breaking the client parser.
The [DONE] sentinel follows the same convention as the OpenAI API so any existing SSE client code works without modification.

The Frontend

The browser side has three responsibilities: send messages, display the stream, and maintain history.

<!-- public/index.html -->
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>GC Streaming Chat</title>
    <style>
      body { font-family: system-ui, sans-serif; max-width: 700px; margin: 40px auto; padding: 0 16px; }
      #messages { border: 1px solid #ddd; border-radius: 8px; padding: 16px; min-height: 300px; margin-bottom: 16px; }
      .message { margin-bottom: 12px; line-height: 1.5; }
      .user { font-weight: 600; }
      .assistant { color: #333; }
      #input-row { display: flex; gap: 8px; }
      #user-input { flex: 1; padding: 8px 12px; font-size: 1rem; border: 1px solid #ddd; border-radius: 6px; }
      button { padding: 8px 16px; background: #0070f3; color: white; border: none; border-radius: 6px; cursor: pointer; }
      button:disabled { opacity: 0.5; cursor: default; }
    </style>
  </head>
  <body>
    <h2>Streaming Chat -- GeneralCompute</h2>
    <div id="messages"></div>
    <div id="input-row">
      <input id="user-input" type="text" placeholder="Ask something..." />
      <button id="send-btn" onclick="sendMessage()">Send</button>
    </div>

    <script>
      const messagesEl = document.getElementById("messages");
      const inputEl = document.getElementById("user-input");
      const sendBtn = document.getElementById("send-btn");

      // Conversation history kept in memory
      const history = [
        {
          role: "system",
          content: "You are a helpful assistant. Be concise.",
        },
      ];

      function appendMessage(role, content) {
        const div = document.createElement("div");
        div.className = `message ${role}`;
        div.dataset.role = role;
        div.textContent = role === "user" ? `You: ${content}` : `Assistant: ${content}`;
        messagesEl.appendChild(div);
        messagesEl.scrollTop = messagesEl.scrollHeight;
        return div;
      }

      async function sendMessage() {
        const text = inputEl.value.trim();
        if (!text) return;

        inputEl.value = "";
        sendBtn.disabled = true;

        // Add user message to history and UI
        history.push({ role: "user", content: text });
        appendMessage("user", text);

        // Create a placeholder for the assistant reply
        const assistantDiv = appendMessage("assistant", "");
        let assistantText = "";

        try {
          const response = await fetch("/chat", {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({ messages: history }),
          });

          const reader = response.body.getReader();
          const decoder = new TextDecoder();

          while (true) {
            const { done, value } = await reader.read();
            if (done) break;

            const raw = decoder.decode(value, { stream: true });
            // Each chunk may contain multiple SSE lines
            for (const line of raw.split("\n")) {
              if (!line.startsWith("data: ")) continue;
              const payload = line.slice(6).trim();
              if (payload === "[DONE]") break;

              try {
                const { token, error } = JSON.parse(payload);
                if (error) throw new Error(error);
                assistantText += token;
                assistantDiv.textContent = `Assistant: ${assistantText}`;
                messagesEl.scrollTop = messagesEl.scrollHeight;
              } catch {
                // ignore parse errors on partial chunks
              }
            }
          }

          // Save completed assistant reply to history
          history.push({ role: "assistant", content: assistantText });
        } catch (err) {
          assistantDiv.textContent = `Assistant: [Error: ${err.message}]`;
        } finally {
          sendBtn.disabled = false;
          inputEl.focus();
        }
      }

      inputEl.addEventListener("keydown", (e) => {
        if (e.key === "Enter" && !sendBtn.disabled) sendMessage();
      });
    </script>
  </body>
</html>

The key pattern here is using response.body.getReader() rather than the EventSource API. EventSource is simpler but only supports GET requests, which means you can't pass the message history in a request body. The ReadableStream approach gives you full control over the request while still processing the SSE format.

Context Management

Multi-turn chat requires sending the full conversation history on every request. The history array above grows linearly with the conversation length, which matters for two reasons:

Prompt tokens cost money. Every message in history is re-sent and re-processed each turn.
Context windows have limits. Most models cap out between 32K and 128K tokens. A long conversation will eventually overflow.

Simple Sliding Window

For most apps, a sliding window works well. Keep the system message, drop the oldest user/assistant pairs when history exceeds a token budget:

function trimHistory(messages, maxTurns = 20) {
  const system = messages.filter((m) => m.role === "system");
  const conversation = messages.filter((m) => m.role !== "system");

  // Keep the most recent maxTurns * 2 messages (user + assistant pairs)
  const trimmed = conversation.slice(-maxTurns * 2);
  return [...system, ...trimmed];
}

Call this before sending to the server:

const response = await fetch("/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ messages: trimHistory(history) }),
});

Summarization for Longer Sessions

For sessions that need to persist longer context, you can periodically summarize old turns:

async function summarizeOldTurns(messages) {
  const toSummarize = messages.slice(1, -10); // Keep system + last 10
  const recent = messages.slice(-10);

  if (toSummarize.length < 6) return messages; // Not worth summarizing yet

  const summaryResponse = await client.chat.completions.create({
    model: "llama-4-scout", // Cheaper model for summarization
    messages: [
      {
        role: "user",
        content: `Summarize this conversation in 3-4 sentences:\n\n${toSummarize
          .map((m) => `${m.role}: ${m.content}`)
          .join("\n")}`,
      },
    ],
  });

  const summary = summaryResponse.choices[0].message.content;

  return [
    messages[0], // system
    { role: "system", content: `Earlier conversation summary: ${summary}` },
    ...recent,
  ];
}

This approach costs one extra API call every N turns but keeps context manageable for hour-long sessions.

Error Handling and Reconnection

Production apps need to handle mid-stream failures gracefully. The server can drop the connection on network hiccups. A simple retry wrapper on the client covers most cases:

async function fetchWithRetry(url, options, maxRetries = 2) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url, options);
      if (!response.ok) throw new Error(`HTTP ${response.status}`);
      return response;
    } catch (err) {
      if (attempt === maxRetries) throw err;
      await new Promise((r) => setTimeout(r, 500 * (attempt + 1)));
    }
  }
}

On the server side, catch stream errors explicitly and close the SSE connection cleanly rather than leaving it hanging:

req.on("close", () => {
  // Client disconnected -- abort the stream if still running
  stream.controller.abort();
});

Running the App

node server.js

Open http://localhost:3000 in a browser and start chatting. Tokens stream in as they generate.

Choosing a Model

GeneralCompute supports a range of models at different speed/capability trade-offs. For a chat app, a few good options:

| Model | Best for | |---|---| | llama-4-maverick | General chat, good balance of quality and speed | | llama-4-scout | Faster responses, lower cost, great for lighter tasks | | qwen3-coder | Technical or coding-heavy conversations |

Swap the model string in the server to try different ones without touching anything else.

What to Build Next

This example covers the core loop. A few natural extensions:

Persistence: Store history in Redis or a database so conversations survive page reloads
Streaming to multiple clients: Use a message bus to fan a single inference stream to multiple WebSocket connections
Tool calling: Add function definitions to the messages array and handle tool_calls deltas in the stream loop
Rate limiting: Add per-user request throttling before deploying publicly

If you're building something production-ready, GeneralCompute's API docs cover the full completions API, available models, and rate limits. The API key that works in this example is the same one you'd use in any OpenAI SDK integration.