Skip to main content

Troubleshooting Guide

This guide helps you troubleshoot and understand any common issues or errors that you come across while using the AI App Store.

failed scheduling app

This error may occur when running/resuming an app. This error is an internal error, meaning the App Store itself failed to fulfill the request due to circumstances outside its control or the control of the user.

Causes

Typical causes for this error are related to Kubernetes or App Store configuration, such as:

  • The Kubernetes cluster is out of capacity. In that case additional compute capacity must be added to schedule the app. When using autoscaling, it either hit the configured ceiling or scaling up the cluster took longer than the App Store timeout.

  • The app refers to nonexistent secrets, which will prevent the app from starting, thus causing the App Store action to time out. App Store has validations that try to prevent this error, but it can still happen if the Kubernetes state is modified from the outside.

  • The App Store runtime version or the server at large may be incorrectly configured, e.g. with an invalid GPU type, and as such the app cannot find a suitable node to run on.

  • The App Store server and the Kubernetes cluster have been incorrectly configured w.r.t. taints and tolerations, such that the app cannot find a suitable node to run on.

  • The Kubernetes control plane is just be too slow/overworked and has been unable to schedule the app in the alloted time.

  • The container registry is temporarily unavailable/down, so the container image for the app cannot be pulled, or takes an excessive amount of time to pull.

Mitigation

Due to the nature of the error and for security reasons we do not report the details about the error to the end user. So to resolve this error you may either try again in a little while or you have to ask your administrator to consult the Kubernetes/App Store logs and determine the actual root cause on your behalf.