How to handle event retries?

About the Problem

When the execution of an event’s handlers takes too long — for example, 800ms — after a certain amount of time a retry occurs (or more, depending on whether execution is slow again). I believe this happens because the service that triggers the event has a timeout and performs the necessary retries until the event responds in less time than the configured timeout.

How to Reproduce

  1. Clone the events-example repository.
  2. Make the event take longer than 800ms (you can use a setTimeout).
  3. Check whether retries are occurring by watching the terminal.

What Is My Scenario?

I am emitting N events to process products in batches (50 products per batch). The events are as follows:

Event 1: Retrieve the product IDs from a specific category, then emit and send the data to Event 2.

Event 2: Create specifications and specification values, then emit and send the data to Event 3.

Event 3: Update the product specifications, then emit and send the data to Event 4.

Event 4: Save some information to Master Data, then emit and send the data to Event 1.

The data passed between Events 1, 2, and 3 consists of the product IDs for the current iteration. In Event 4, the next page is sent to continue the event chain.

So the event chain looks like this:

Event 1 => Event 2 => Event 3 => Event 4 => Event 1 => …

Until it finishes, when the page number exceeds the last page (total products / 50).

Goals

  1. Get more information on how events work in the VTEX IO backend.
  2. Get more information on what factors can trigger retries — timeouts, errors, HTTP error codes, etc.
  3. Recommendations on how to handle events that may take longer than expected or that involve a complex event chain, and how to explicitly avoid retries (middleware?).

Suggestion

It would be great if you could create some tutorials or advanced guides on the events feature.

I made a change to the timeout property in service.json — it previously had a value of 2 seconds, and I changed it to 54 seconds, which improved things.

I recommend not increasing this value above 54 seconds, as the VTEX API Gateway throws an error after 55 seconds.

I still consider the points listed in the Objective section to be valid.