Connection timed out when performing PUT

Good morning,

In our environment, we have an ERP integration with VTEX. Since Friday (02/16), around 7 a.m., our integration has been occasionally returning the message ‘Connection timed out’ when attempting to update VTEX inventory via PUT. The timeout is intermittent — it doesn’t always happen for the same SKU, but in every execution at least 1 SKU comes back with a timeout.

I’d like to know if there was any update on VTEX’s end on 02/16 that could be affecting response times, or if there is any intermittent instability in their services.

Good morning, Victor!

Victor, we have no record of any updates on that day! But it’s possible we’re dealing with a throttling case — I didn’t find anything specific for this API, but I can check with the team to find out what the limits are.

Can you tell me roughly how many SKUs are updated at the same time, or at what interval this occurs?

Karina Mota
Field Software Engineer | VTEX

Good morning Karina,

We have an average of 440 SKUs being updated per run.
Our system is configured to run the integration (order and inventory integration) every 20 minutes, and each run takes about 7 to 10 minutes to complete.

Hey @victormatos, how’s it going?

VTEX’s documentation doesn’t disclose the limits for its APIs, but they do mention that these limits can vary depending on the day.

Because of that, the best approach is to always check the response headers of each request to see if there’s any information about a rate limit being hit, before sending the next request.

What’s the error code returned for these requests that end up failing? Would it be a 522?

Cheers!

Hi Andre, how are you doing?

Thanks for the comment.

Regarding the returned code: It doesn’t return a code, only the message “Connection Timed Out”.
To illustrate, here are three examples of logs we received:
ID 9 : Connection timed out.
ID 40 : 403 Forbidden
ID 193 : 500 InternalServerError

We configured a log generation where the ID is the product that threw the error, followed by the returned error code. As you can see, the “Connection timed out” I’m referring to doesn’t return a code, only the error description.

Hi Victor, good morning!

I think the ideal approach in this scenario is to try to understand each error separately.

  • Connection timed out is an error that can occur due to throttling, but it normally comes with a 429 status code. I checked with our team and in this case the recommendation is that, since we don’t actually have a reference value, the best approach is to space out your calls more.

  • 403 Forbidden is an error related to permissions, which seems strange when it only occurs with a single SKU.

  • 500 InternalServerError is an error that typically occurs when the request is incorrectly structured — a wrong header, an incorrect value, etc.

The odd thing is that you mentioned these errors happen with different SKUs every time you perform a stock update, right?

Good morning Karina!

Exactly, that’s what we’re finding strange too — it doesn’t always happen with the same SKU. Sometimes it happens with just 1 SKU, sometimes with 4 different SKUs, sometimes with 2… we haven’t been able to identify a pattern.

The only pattern we actually identified is that after 02/16, every execution has been coming back with at least 1 SKU with this timeout error. Before that date, the 403 error occurred more often, but it was sporadic. That’s why we suspect some change/incident on VTEX’s end, or some bottleneck in the services on this side.

Maybe the spacing between calls could be a possibility. We’ll need to analyze and test the source code change internally.

Good morning everyone,

@KarinaMota, just following up with updates on the reported case.

We tried adding a delay between products, going up to 1.5 seconds between them, but the Connection Timed Out error persisted for some SKUs (changing to a different SKU with an error on each execution, just as described in my previous report). And if we increased the delay any further, our stock update program would take more than 20 minutes to finish, so we believe that approach wouldn’t be efficient for our use case.

The solution we found, which has apparently proven viable based on our tests, was similar to @andremiani’s recommendation:
We were already analyzing the error responses, but we added a condition in our code specifically for this error. On each update attempt’s response (PUT), if the error comes back with the description “Connection Timed Out.” the program will try to send another PUT request; if the error persists after 3 attempts, only then is the error logged in our error log.

Note: During our tests, whenever we received the Connection Timed Out error, 100% of the time the very next attempt already returned a positive response, without needing all 3 retries.

Good morning @KarinaMota,

Just a quick addendum.

We applied the solution I mentioned to production.
And just to let you know, the 429 code you mentioned — we received that response today and it comes with the description “429 TooManyRequests”, as shown in our log:
ID 169 : 429 TooManyRequests

So, the “Connection Timed Out” must be a different type of error and it really doesn’t come with a code, just a description. Based on the description, the 429 is indeed a throttling case, but the reason for the Timed Out is still unclear.