Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daprd deleting actors unexpectedly #7734

Open
grizzlybearg opened this issue May 15, 2024 · 6 comments
Open

Daprd deleting actors unexpectedly #7734

grizzlybearg opened this issue May 15, 2024 · 6 comments
Labels
kind/bug Something isn't working

Comments

@grizzlybearg
Copy link

In what area(s)?

runtime
daprd

What version of Dapr?

edge: daprio/daprd:edge (latest release)

Expected Behavior

My app uses the actor component. The actor in our code has a timer that is triggered at regular intervals. My expectation is that the timer is expected to be triggered without the daprd deleting the actors.
I also expect the daprd container to be able to conduct a healthz check on the actor without failure.

Actual Behavior

The daprd logs show the following log:

2024-05-15 13:00:15 time="2024-05-15T10:00:15.742793315Z" level=error msg="Error performing request: Get "http://10.5.0.6:8884/healthz\": context deadline exceeded" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:20 time="2024-05-15T10:00:20.743193975Z" level=error msg="Error performing request: Get "http://10.5.0.6:8884/healthz\": context deadline exceeded" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:20 time="2024-05-15T10:00:20.743313885Z" level=warning msg="Actor health check failed 4 times, marking unhealthy" app_id=envrunneractor instance=bfd5899a36a7 scope=actorshealth type=log ver=edge
2024-05-15 13:00:21 time="2024-05-15T10:00:21.056283208Z" level=debug msg="Disconnecting from placement service by the unhealthy app" app_id=envrunneractor instance=bfd5899a36a7 scope=dapr.runtime.actors.placement type=log ver=edge
2024-05-15 13:00:21 time="2024-05-15T10:00:21.057393103Z" level=debug msg="Halting actor 'envrunneractor||TESTER'" app_id=envrunneractor instance=bfd5899a36a7 scope=dapr.runtime.actor type=log ver=edge

This message tends to happen during the creation of the actor for the first time or when a timer callback is invoked. Immediately after this log message appears, all actors are deleted (deactivated). When having thousands of actors, recovery of these actors is compute intensive given that there's a lot of data associated with each actor. Therefore, it would be ideal if we stopped the random deactivation of actors. I have been unable to deactivate the actor healthz check.

I've confirmed that our internal code works without any issues (even without dapr (single processs runtime)), therefore, that is not a reason for the daprd runtime to delete actors.

Notes:

I do know that dapr creates the healthz endpoint for the actor component automatically:
image

I've confirmed that the healthz url is working
image

Steps to Reproduce the Problem

Our internal code isn't public but I can share the docker compose file that we are using for dev to CICD

name: app
services:
  envrunneractor:
    container_name: main
    image: "local:latest"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /home/workspace/sdk:/workspaces/sdk:cached

  command: >
    /bin/bash -c "
      uvicorn actor_service:app --port 8884 --host 10.5.0.6"
  ports:
    - "50002:50002"
    - name: main_app
      target: 8884
      host_ip: 0.0.0.0
      published: "8884"
      protocol: tcp
      app_protocol: http

  depends_on:
    - redis
    - placement
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.6

  environment:
    DAPR_API_METHOD_INVOCATION_PROTOCOL: "http"
    DAPR_GRPC_ENDPOINT: "10.5.0.6:50002?tls=true"
    DAPR_HTTP_ENDPOINT: "http://10.5.0.6:3500"
    DAPR_HTTP_PORT: "3500"
    DAPR_GRPC_PORT: "50002"
    APP_ID: "envrunneractor"
    DAPR_HEALTH_TIMEOUT: "3000"

runner-dapr:
  image: "daprio/daprd:edge"
  container_name: dapr
  environment:
    DAPR_HOST_IP: "10.5.0.6"
    APP_PORT: "8884"

  command: "./daprd \
      --app-id envrunneractor \
      --app-port 8884 \
      --dapr-grpc-port 50002 \
      --dapr-http-port 3500 \
      --resources-path /components \
      --log-level debug \
      --mode standalone \
      --actors-service placement:10.5.0.7:50004,10.5.0.6:50002 \
      --app-protocol http \
      --app-channel-address 10.5.0.6"

  volumes:
    - "./components/:/components"
  depends_on:
    - envrunneractor
  network_mode: "service:envrunneractor"

############################
# Dapr placement service
############################
placement:
  container_name: placement
  image: "daprio/dapr:edge"
  command: "./placement \
      --port 50004 \ 
      --log-level debug"
  ports:
    - "50004:50004"
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.7

############################
# Redis state store
############################
redis:
  container_name: redis
  image: "redis:6"
  ports:
    - "6379:6379"
  networks:
    dev-cloud:
      ipv4_address: 10.5.0.8

networks:
dev-cloud:
  external: true

Our internal code is inspired by the dapr example found at: https://github.com/dapr/python-sdk/tree/release-1.0/examples/demo_actor

@grizzlybearg grizzlybearg added the kind/bug Something isn't working label May 15, 2024
@ItalyPaleAle
Copy link
Contributor

2 things:

  1. Timers are only invoked if the actor is already active. If the actor gets deactivated for any reason, including rebalancing (which can happen randomly if you scale Dapr), then the timers won't fire. If you want a "persistent" timer, you should use a reminder
  2. From the logs, it appears that Dapr can't invoke /healthz on your app. Implementing a /healthz endpoint in your app is required for using actors, and it must respond with a 2xx status code. Seems that your app may have temporarily stopped responding?

@grizzlybearg
Copy link
Author

2 things:

  1. Timers are only invoked if the actor is already active. If the actor gets deactivated for any reason, including rebalancing (which can happen randomly if you scale Dapr), then the timers won't fire. If you want a "persistent" timer, you should use a reminder
  2. From the logs, it appears that Dapr can't invoke /healthz on your app. Implementing a /healthz endpoint in your app is required for using actors, and it must respond with a 2xx status code. Seems that your app may have temporarily stopped responding?

@ItalyPaleAle

  1. I've noticed that if the actor is deactivated, the reminders will not fire either.
  2. The dapr actor sdk for fastapi automatically sets up an /healthz endpoint. Are you suggesting I find a way to replace the existing endpoint? Second, the existing (automatic from the actor sdk) healthz endpoint does work when invoked manually from the both the host and from within the daprd container:
    image. As you can see from the image, it does respond with a 2xx status code.

Is there a reason why daprd tries to send a http://10.5.0.6:8884/healthz request only when the timers and reminders fire? Is there a way to to disable this? The docs suggest that the health check is disabled by default. I don't understand why daprd still tries to invoke the health endpoint with the health check disabled

@elena-kolevska
Copy link
Contributor

@grizzlybearg The reason we have to have a healthz check is so that dapr can know if the application can still serve those actor types. Otherwise, we could end up in a scenario where an app comes online, registers actor types A, B and C, and crashes, but the placement service and dapr sidecar still try to invoke actors on it.
When the placement service is aware that a host is down, it will properly rebalance the actors and it will forward the requests to other hosts that host the same actor type.

@grizzlybearg
Copy link
Author

Hey @elena-kolevska @ItalyPaleAle. After further testing and modification of our internal code. I've come to a conclusion that this issue:

{"app_id":"envrunneractor","instance":"5ef30f00f188","level":"error","msg":"Error performing request: Get "http://10.5.0.6:8884/healthz\": context deadline exceeded","scope":"actorshealth","time":"2024-06-05T07:00:03.742868314Z","type":"log","ver":"edge"}

only shows up when an actor method has been invoked, the method takes longer than 5 seconds to complete a task and this is not isolated to when the timers or reminders are fired as I thought earlier. If this error shows up a number of times, it deletes the actors unexpectedly despite the actors being healthy as they perform their work. This makes the implementation of virtual actor's pattern with dapr not ideal if there's no way to stop these checks during the invocation of actor methods or at least extend the context deadline to allow for the methods to finish their work.

Think of it as follows:
Say, you have 100 actors of type test and the same method of 20 actors has to be invoked (the method takes more than 5 seconds to complete the task) and another 20 actors will have the same method invoked 1 minute after the first round of invocation. The above issue will be raised a couple of times (4 on average). As the 5th or 10th actor method is being invoked, dapr suddenly deletes all actors because it marked the service as unhealthy. This removes all actors, their reminders and timers. This implies that the second round of actors or the remaining actors will not have their methods invoked since they do not exist anymore.

@grizzlybearg The reason we have to have a healthz check is so that dapr can know if the application can still serve those actor types. Otherwise, we could end up in a scenario where an app comes online, registers actor types A, B and C, and crashes, but the placement service and dapr sidecar still try to invoke actors on it. When the placement service is aware that a host is down, it will properly rebalance the actors and it will forward the requests to other hosts that host the same actor type.

It makes sense that the above approach mentioned by @elena-kolevska to ensure the placement service is aware of all stable and unstable actor services. Therefore, I think there should be a way to accommodate scenarios such as ours. Again, we've rigorously tested our code without dapr with no issues, so there's no reason for dapr to delete the actors as they are stable.

Is there a way to modify the context deadline (delay or increase the deadline) of these checks when actor methods are being invoked or at least disable these checks during invocation?
Thanks

@elena-kolevska
Copy link
Contributor

Thanks for digging into this @grizzlybearg, I'll look into it as soon as possible.

@grizzlybearg
Copy link
Author

Thanks for digging into this @grizzlybearg, I'll look into it as soon as possible.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants