Handling Long Responses with VertexAI API • ainoya.dev

When using generative AI to process large contexts, such as PDFs, the output may sometimes be cut off. This issue is especially common when the generated text is long. In this post, I will explain how to handle this problem using VertexAI, specifically through recursive API calls to obtain the full response.

Using “Continue” in Chat UIs

In a chat UI with VertexAI, if the output stops prematurely, you can type “continue” and the AI will pick up where it left off. However, when using API calls, you don’t have this interactive capability, so a different approach is needed to handle this issue.

Handling the Issue with API Calls

To address this problem in API calls, you need to monitor the finishReason in the response. If the output is cut off, you can check this value and reissue the generation request if necessary. Below is a step-by-step guide on how to handle it.

1. Checking the Response

Within the API response, there is a finishReason that indicates why the generation stopped. You need to check this value to determine whether to send another generation request or not.

Here’s a reference from Google Cloud’s documentation on VertexAI:

FINISH_REASON_UNSPECIFIED (0):

The finish reason is unspecified.

STOP (1):

Natural stop point of the model or provided
stop sequence.

MAX_TOKENS (2):

The maximum number of tokens as specified in
the request was reached.

2. Retaining Context

When reissuing a generation request, it is important to include the context of the conversation, such as previous outputs or input. If this context is not included, the AI may generate text that is unrelated to the previous output.

3. Managing Token Count

One key point to watch out for is that recursive generation requests will increase the number of input tokens. If the token count becomes too high, it can slow down processing or even hit the token limit. Therefore, it’s important to manage the context carefully, trimming unnecessary parts when possible.

Code Example

Here is a concrete example from a GitHub repository that implements a mechanism to recursively retrieve responses when the output is cut off.

Code Example

// https://github.com/ainoya/PocketMD/blob/bd46bf40d83c77bba24f5c46aa64d97e07057f82/src/describe_pdf.ts#L66C15-L98C4

// Define a constant for the word to request continuation of content generation.
// This word is taken from the environment variable `CONTINUE_WORD`, and defaults to "Please continue" if not set.
const continueWord = process.env.CONTINUE_WORD || "Please continue";

// Set the maximum number of loops to prevent infinite looping during content generation.
const maxLoopCount = 10;
let loopCount = 0;

// Infinite loop to keep generating content until the stop conditions are met.
while (true) {
  // If the loop count exceeds the maximum allowed iterations, break the loop to avoid excessive processing.
  if (loopCount >= maxLoopCount) {
    console.log("Max loop count reached");
    break;
  }

  // Increment the loop counter on each iteration.
  loopCount++;

  // Log to indicate the content generation process has started.
  console.log("Generating content...");

  // Send a request to the API client to generate content based on the existing conversation context.
  const generated = await client.generateContent({
    contents: contents,
  });

  // Extract the content from the response, assuming it's in the first candidate.
  const content = generated.response.candidates?.at(0)?.content;

  // If content is successfully generated, add it to the existing conversation context.
  if (content) {
    contents.push(content); // Add the generated content to the conversation context.

    // Add a user input asking the model to continue generating content, using the defined continue word.
    contents.push({ role: "user", parts: [{ text: continueWord }] });
  } else {
    // If no content was generated, log a message and break the loop.
    console.log("No content generated");
    break;
  }

  // Check the reason why the generation stopped (e.g., reaching token limit, completion).
  const finishReason = generated.response.candidates?.at(0)?.finishReason;

  // If the reason for stopping is not due to reaching the maximum token limit, break the loop.
  // This means the content is considered "finished" or some other stop condition has occurred.
  if (finishReason !== "MAX_TOKENS") {
    console.log("Finish reason:", generated.response.candidates?.at(0)?.finishReason);
    break;
  }
}

Key Considerations

Context Retention: When reissuing a generation request, always include the previous conversation history. If you fail to do this, the AI may not generate the continuation you expect.
Token Management: Be mindful of the token count when recursively generating responses. To avoid hitting limits, trim unnecessary parts of the conversation history when possible.

By carefully monitoring the finishReason and managing the context and token count, you can effectively handle the issue of truncated responses in API-based content generation.