Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ollama modelinstance can permafail with "connection refused" #242

Open
lukemarsden opened this issue Mar 28, 2024 · 0 comments
Open

ollama modelinstance can permafail with "connection refused" #242

lukemarsden opened this issue Mar 28, 2024 · 0 comments
Assignees

Comments

@lukemarsden
Copy link
Collaborator

the runner serving ollama_mistral will just start doing this, bringing down our production service:

:x: there was a session error https://app.tryhelix.ai/session/XXX failed to get response from inference API: Post "http://localhost:41533/v1/chat/completions": dial tcp 127.0.0.1:41533: connect: connection refused

we need to properly run this down and fix the root cause. i've put in a temporary fix here to just exit and restart the runner when we detect this condition, but we need to fix it properly: #241

sometimes, presumably the ollama server exits but we don't handle that by restarting it or cleaning up our record of the model instance.

one issue is that pkg/runner/controller.go is the only call to m.Stop(), which cleans up the modelinstance from r.activeModelInstances

		err := m.Stop()
			if err != nil {
				log.Error().Msgf("error stopping model instance %s: %s", m.ID(), err.Error())
			}
			r.activeModelInstances.Delete(m.ID())

but nothing cleans up r.activeModelInstances if the ollama instance exits itself. but that might not be the root cause, the ollama instance itself keeps running even when the ollama process exits:

	if err := cmd.Wait(); err != nil {
			log.Error().Msgf("Ollama model instance exited with error: %s", err.Error())

			errMsg := string(stderrBuf.Bytes())
			if i.currentSession != nil {
				i.errorSession(i.currentSession, fmt.Errorf("%s from cmd - %s", err.Error(), errMsg))
			}

			return
		}

		log.Info().Msgf("🟢 Ollama model instance stopped, exit code=%d", cmd.ProcessState.ExitCode())

nothing here causes the goroutine later in Start:

go func() {
		for {
			select { // ...

to exit, or close the workCh.

so to summarize, if the ollama process exits or is killed:

  • nothing stops the OllamaModelInstance or closes the channel
  • nothing deletes the model instance from controller's activeModelInstances

except the sys.Exit i added in #241

this issue is to properly handle this case without causing any errors to surface to the user (ideally), e.g. by restarting the ollama process if it exits or gives us "connection refused", and by not restarting the entire runner

@rusenask rusenask self-assigned this Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants