Unexpected quarantine

gakesson · November 27, 2019, 2:24pm

Hello,

We’re running Akka 2.5.19 and are not using cluster (but instead remote Artery).
We have an odd behavior in which we occassionally get quarantined system when communicating between two of the actor systems, due to GracefulShutdownQuarantinedEvent.

For simplication, we have the following setup with 3 different remote actor systems:

Server1
Client1
Client2

An ActorRef from Client2 is serialized and sent remotely (as part of a message) to Server1, which then sends that same serialized ActorRef to Client1.
Client1 is then able to use this ActorRef to send messages directly to Client2. Note here that we do not do any actor-selection but use the ActorRef only, received from Server1.

This solution is usually working fine but we see that sometimes (it does not occur consistently) when Client2 is down for e.g. 6-7 minutes, Client1 puts the Client2 actor system in quarantined state. This is the logs we have gathered from akka and subscribing to various lifecycle events:


2019-11-27T13:30:00.924+0100 | Association to [akka://AS-A13601@Client2:13601] having UID [6812985527646718599] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
2019-11-27T13:30:00.926+0100 | Received GracefulShutdownQuarantinedEvent for remote ActorSystem: Association to [akka://AS-A13601@Client2:13601] having UID [6812985527646718599] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
2019-11-27T13:30:01.057+0100 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-123-1#1043978606]
2019-11-27T13:30:01.059+0100 | started (akka.stream.impl.io.TLSActor@2b060802)
2019-11-27T13:30:01.059+0100 | now watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$cb#-370902978]
2019-11-27T13:30:01.060+0100 | now supervising Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962]
2019-11-27T13:30:01.060+0100 | started (akka.io.TcpOutgoingConnection@2ca859a6)
2019-11-27T13:30:01.061+0100 | now watched by Actor[akka://AS-A13601/system/IO-TCP/selectors/$a#1664229512]
2019-11-27T13:30:01.061+0100 | Resolving Client2 before connecting
2019-11-27T13:30:01.061+0100 | Resolution request for Client2 from Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962]
2019-11-27T13:30:01.061+0100 | Clear system message delivery of [akka://AS-A13601@Client2:13601#6812985527646718599]
2019-11-27T13:30:01.072+0100 | Attempting connection to [Client2/10.61.92.136:13601]
2019-11-27T13:30:01.075+0100 | Could not establish connection to [Client2:13601] due to java.net.ConnectException: Connection refused
2019-11-27T13:30:01.076+0100 | stopped
2019-11-27T13:30:01.076+0100 | received AutoReceiveMessage Envelope(Terminated(Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962]),Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962])
2019-11-27T13:30:01.080+0100 | no longer watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$cb#-370902978]
2019-11-27T13:30:01.080+0100 | closing output
2019-11-27T13:30:01.080+0100 | stopped
2019-11-27T13:30:01.080+0100 | [outbound connection to [akka://AS-A13601@Client2:13601], control stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(Client2:13601,None,List(),Some(5000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused
2019-11-27T13:30:01.080+0100 | Restarting graph due to failure. stack_trace:  (akka.stream.StreamTcpException: Tcp command [Connect(Client2:13601,None,List(),Some(5000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused)
2019-11-27T13:30:01.081+0100 | Restarting graph in 2020836690 nanoseconds
2019-11-27T13:30:03.954+0100 | [outbound connection to [akka://AS-A13601@Client2:13601], message stream] Upstream failed, cause: Association$OutboundStreamStopQuarantinedSignal$:
2019-11-27T13:30:03.959+0100 | Outbound message stream to [akka://AS-A13601@Client2:13601] was quarantined and stopped. It will be restarted if used again.
2019-11-27T13:30:03.959+0100 | stopped
2019-11-27T13:30:03.962+0100 | [outbound connection to [akka://AS-A13601@Client2:13601], control stream] Upstream failed, cause: Association$OutboundStreamStopQuarantinedSignal$:
2019-11-27T13:30:03.963+0100 | Outbound control stream to [akka://AS-A13601@Client2:13601] was quarantined and stopped. It will be restarted if used again.
2019-11-27T13:30:03.963+0100 | stopped
2019-11-27T13:37:42.531+0100 | Dropping message [SomeMessage] from [Actor[akka://AS-A13601/user/$Ne#161182002]] to [Actor[akka://AS-A13601@Client2:13601/user/$b/$a#-640698163]] due to quarantined system [akka://AS-A13601@Client2:13601]

And the interesting aspect here is that restarting Client2 does not help! That triggers a new ActorRef from the restarted Client2 towards Server1, which is sent to Client1 but Client1 continues to think that Client2 is in quarantined state. After taking a heap-dump I see that the arterty AssociationState’s field cachedAssociation seems to be stuck in the quarantined state even though a new ActorRef is used…

Any ideas on if this is a bug and if there is a workaround through configuration etc.?
Is this really the idea of qurantined actor systems?

Thanks & Best Regards,
Gustav Åkesson

johanandren · November 29, 2019, 2:10pm

Sounds like it could be a bug, if you could create an isolated reproducer that would be great.

gakesson · December 2, 2019, 9:38am

Thanks!

When my daily workload cools off I will try to reproduce it with a stand-alone application.
FYI - we’re currently able to workaround this issue by doing an ActorSelection in Client1, i.e. we receive a Client2 ActorRef from Server1, picks out the ActorPath from that ActorRef, and then continously do a an actor-selection whenever Client1 sends message to Client2. This gets rid of this quarantine issue and we also see this in the logs (where issue was previously noticed):

It looks like that actor-selection then resolve the association’s graceful quarantine…

Thanks & Best Regards,
Gustav Åkesson

Theo · November 17, 2021, 3:14pm

Hi, I was wondering what ever became of this discussion.

In a local test setup I have done the following observation.

It is a setup I am currently trying out in preparation of a technical upgrade to Akka 2.6.

As you might know : in Akka 2.6 Cluster Client is getting deprecated and support for it will be removed.
We are invited to move to grpc.

That is the use case we are now trying to discover in a playground setup : it would simulate the registration of an adapter component with a central component inside an Akka cluster. The idea is that adapters are standalone JVMs running an actor system. It wants to register the remoting address of its entry point actor with the singleton hub on the cluster system.

In the past we would publish a registration actor with the cluster receptionist while the adapters would use the Cluster client to find that registration actor in inside the cluster to communicate their actor ref to the central system. That’s our background and setup.

Now I created this setup, but just like you, I am skipping Cluster for the setup and using only Akka Artery remoting.

4 JVM’s I am spinning up : 1. the hub system 2. adapter-A system 3. adapter-B system 4 adapter-C system. The hub system has a small grpc service that exposes a register method for the adapters to communicate their whereabouts (actor ref registration).

The algorithm is that the grpc call registers each type of adapter with a adapter-type specific manager actor that keeps track of all the registrations for a specific type of adapter. The service implementation for grpc, when receiving a registration request, will send a registration request to the correct manager actor whose task it is to maintain a set of active adapters. The manager will start watching the actorrefs passed to it. Before sending a success response back to the adapter, an actual remoting call is used to validate the submitted actorref. If that call succeeds, the adapter is informed that the hub was able to communicate with it via Akka remoting.

Observations:

all 3 adapters can successfully register themselves during the first run
however, if I shut down any of the adapter systems and restart them, the remoting validation times out.
the logs seem to indicate that:
when you shut down one of the adapter actor systems, that system is said to be quarantined by the hub system, but it states that is will do this based on its UID (which we assume is changing after each restart.
sample message :
15:22:34 INFO Association - Association to [akka://SAP@127.0.0.1:25521] having UID [-921023442906631572] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
however : when I restart an adapter node, the log states that the messaging attempt was dropped with
the log on the adapter node clearly shows there is now a new UID
ArteryTransport - Remoting started with transport [Artery tcp]; listening on address [akka://SAP@127.0.0.1:25521] with UID [-8238620317667557772]
but on the hub we see this log entry
DEBUG Association - Dropping message [com.example.actors.SimpleActor$Tick] from [Actor[akka://ProviderHub/user/sap-adapter-manager#220953438]] to [Actor[akka://SAP@127.0.0.1:25521/user/sap_adapter-router#-642669379]] due to quarantined system [akka://SAP@127.0.0.1:25521]

So, it looks as if the quarantine is applied to the tuple ACTORSYSTEM@HOST:PORT and not only to the UID that is instance / run specific.

Is this a bug or should we review our configuration ?

I am using the latest versions of Akka in this setup

Akka 2.6.17

Theo · November 17, 2021, 7:44pm

Could my observation be related to this (pending ?) issue :

github.com/akka/akka

Association quarantined after new incarnation

opened 09:19AM - 22 Feb 21 UTC

patriknw

1 - triaged 3 - in progress t:remoting:artery

Reported by @kpritam Hi @patriknw, We are facing exactly similar issue. Only …difference in our case is that, we are using remote actorRef with `use-unsafe-remote-features-outside-cluster = on` and not using `actorSelection`. Tried versions: Akka typed - `2.6.11`, `2.6.12` We have following setup: 1. `Actor1` => Running in one JVM and it has registered its remote actorRef in our custom `LocationService`, consider this similar to `Receptionist` 2. `Actor2` => Running in second JVM and registered its address with `LocationService` Scenario: 1. `Actor1` sends `Restart` message to `Actor2` in loop `2000` times. 2. On `Restart` message `Actor2` terminates its `ActorSystem` and starts new on random port (`port=0`) 3. This works fine for 1600-1700 iterations 4. But eventually fails with following message: ``` Dropping message [esw.ocs.api.actor.messages.SequencerMessages$GetSequenceComponent] from [Actor[akka://sequence-manager/deadLetters]] to [Actor[akka://sequencer-system@192.168.1.3:50944/user/ESW.Perf#1625616268]] due to quarantined system [akka://sequencer-system@192.168.1.3:50944] ``` Here, sequence-manager = Actor1 sequencer-system = Actor2 Let me know if you would like to take a look at detailed log file, its 150MB in size hence not attaching here. _Originally posted by @kpritam in https://github.com/akka/akka/issues/29828#issuecomment-782572479_

Topic		Replies	Views
[Akka 2.5.x][Remoting] - Recovering from guaranteed nodes Akka Libraries	5	1802	May 23, 2018
Occassionally dropped messages Akka Libraries	5	1512	February 27, 2019
Quarantine breaks cluster abstraction Akka Cluster	2	969	September 17, 2018
Quarantined node haven't joined back the cluster even after multiple restart Persistence / Event Sourcing	0	1038	September 20, 2018
How to avoid nodes to be quarantined in Akka Cluster? Akka Cluster akka , akka-cluster	2	3414	August 25, 2018

Unexpected quarantine

Related topics