How to handle streaming responses in OpenAI GPT chat completions API

Introduction

Chat completions powered by OpenAI's GPT can offer a truly magical experience to users. However, when consuming this service through their API, it can be frustrating for the user to wait for the whole API response. Streaming responses can help mitigate this problem. When enabled, the OpenAI server can send tokens as data-only Server-Sent Events (SSE) as they become available, creating a chat-like experience. In this blog, we will explore how to configure and implement streaming in OpenAI's chat completions API. We will also look at how to consume these streams using Node.js, highlighting the differences between OpenAI's streaming API and standard SSE.

Why should we enable streaming?

Streaming is a technique that allows data to be sent and received incrementally, without waiting for the entire data to be ready. This can improve performance and user experience, especially for large or dynamic data.

For example, imagine you are watching a video on YouTube. You don't have to wait for the entire video to be downloaded before you can start watching it. Instead, YouTube streams the video in small chunks as you watch it. This way, you can enjoy the video without buffering or interruptions.

Similarly, streaming can be useful for chat completions powered by OpenAI's GPT models. These models are very powerful and can generate long and complex texts based on user input. However, they also take some time to process each request and return a response.

If you use the standard chat completions endpoint, you have to wait until the model finishes generating the entire text before you can receive it as a JSON object. This can create a laggy and unnatural experience for your users.

However, if you use the streaming chat completions endpoint, you don't have to wait for the whole text to be ready. Instead, you can receive tokens as they are generated by the model as data-only SSE. This way, you can display partial texts to your users as they are being written by the model, creating a more interactive and engaging experience.

How is this implemented?

Now that we have established the need for streaming, let's understand how this is configured and implemented.

The completions API is available at the following endpoint:

POST https://api.openai.com/v1/chat/completions

You can invoke this endpoint as follows:

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "stream": true
   }'

p.s. Make sure to set the OpenAI API key at OPENAI_API_KEY environment variable.

The key parameter is stream: true in the payload. This is set to false by default. Once this is set to true, the API responds with Content-Type: text/event-stream and Transfer-Encoding: chunked instead of the regular content-type: application/json. As you can see, the response is formatted like Server-Sent Events.

Server-Sent Events (SSE) is an HTTP-based specification that provides a way to establish a long-running and mono-channel connection from the server to the client. A good explanation of Server-Sent Events can be found here. Instead of repeating it, I will explain what is specific to the OpenAI response. Typical streamed OpenAI response will look like the following:

data: {"id":"xxx","object":"chat.completion.chunk","created":1679168243,"model":"mmm","choices":[{"delta":{"content":"!"},"index":0,"finish_reason":null}]}

data: {"id":"yyy","object":"chat.completion.chunk","created":1679168243,"model":"mmm","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: [DONE]

As you can see, the response is a series of data message chunks, encoded as JSON followed by a [DONE] message.

Let's look into the payload in more detail. Every message chunk has the following structure:

{
  "id": "chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve",
  "object": "chat.completion.chunk",
  "created": 1679168243,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "delta": {
        "content": " test"
      },
      "index": 0,
      "finish_reason": null
    }
  ]
}

Every response will include a finish_reason. The possible values for finish_reason are:

stop: API returned complete model output
length: Incomplete model output due to max_tokens parameter or token limit
content_filter: Omitted content due to a flag from our content filters
null: API response still in progress or incomplete

So, the last message chunk will look like the following:

{
  "id": "chatcmpl-6vWehRmDlfzWxyA0qBAvVd2PtORsJ",
  "object": "chat.completion.chunk",
  "created": 1679168245,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "delta": {},
      "index": 0,
      "finish_reason": "stop"
    }
  ]
}

How is OpenAI's streaming API different from standard SSE

Before we look at how to consume OpenAI's stream API, let's look at how it is different from standard SSE. This is the reason why we cannot use the standard EventSource Web API to consume OpenAI streams. Here are the main differences:

Standard SSE expects GET resources. As you can see, OpenAI expects POST with custom payload.
The same API endpoint will respond with content-type: application/json if there are any errors in the API request itself. So, the client will have to handle both types of responses.

Consuming OpenAI SSE events using Node backend

Now that we have understood the API specification in some detail, let's look at how to consume this API in Node.js.

https://gist.github.com/georgeck/dda222d0eae3d1c3cbf121bc178ca586

Let's go through this code in more detail. This example uses the eventsource-parser npm package. This package handles the parsing of the event-source chunks.

We can use regular fetch API to construct the API request. As we know, the response body is a UTF-8 encoded stream. In the above code, we pipe this stream through a TextDecoderStream() to convert the stream from binary encoding to string.

We create an instance of eventsource-parser using the createParser() method. A callback function is passed as a parameter to this function and this function is invoked every time a message is decoded.

We use an await iterator to read from the response body. Each chunk that we receive from the fetch response stream is passed to the parser.feed() method to decode.

As we noted earlier, OpenAI sends a [DONE] event, once the streaming is complete. So, the callback function will look for that event in the parsed response. If it is not a [DONE] event, we can read the payload as a JSON object.

Another option is to use the sse.js npm package. This package allows us to construct the API payload with custom headers and also use the POST method. I showed the eventsource-parser example first to demonstrate that there is nothing special happening other than data: payload that we saw before.

Conclusion

In conclusion, we have seen how to enable streaming responses in OpenAI's GPT chat completions API. By allowing the server to send data-only server-sent events as they become available, we can create a more responsive and chat-like experience for the user. By using the techniques discussed in this blog, you can create more seamless and responsive chat completions powered by OpenAI's GPT. Whether you're building a chatbot, virtual assistant, or any other conversational AI application, streaming can help take your user experience to the next level.

How to handle streaming in OpenAI GPT chat completions

Introduction

Why should we enable streaming?

How is this implemented?

How is OpenAI's streaming API different from standard SSE

Consuming OpenAI SSE events using Node backend

Conclusion

Comments (1)

More from this blog

Summarize Hacker News Comments with AI Browser Plugin

Command Palette

Introduction

Why should we enable streaming?

How is this implemented?

How is OpenAI's streaming API different from standard SSE

Consuming OpenAI SSE events using Node backend

Conclusion

Comments (1)

More from this blog