Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackTrace on requesting new PageSession #45

Open
thelightcosine opened this issue Feb 19, 2020 · 8 comments
Open

StackTrace on requesting new PageSession #45

thelightcosine opened this issue Feb 19, 2020 · 8 comments
Labels
help wanted Extra attention is needed

Comments

@thelightcosine
Copy link

We're having an issue with Chroxy whereby it eventually gets into a state where it fails to estbalish the connection properly anymore.

We are running Chroxy locally on the box as a separate service.
We will make a call for a new Page sessions like so:
ChroxyClient.page_session!(%{host: "localhost", port: 1330})

When everything is working fine the Chroxy logs look like this:

15:26:17.067 pid=<0.416.0> module=Plug.Logger [info]  GET /api/v1/connection
15:26:17.299 pid=<0.417.0> module=Chroxy.ChromeProxy [debug] Obtained new page from ChromeServer [#PID<0.301.0>]
15:26:17.299 pid=<0.406.0> module=Chroxy.ProxyRouter [debug] put object with key: 09641A45CD7EF339E90AC9536C1C6352 - value: #PID<0.417.0>
15:26:17.301 request_id=FfTq1rmOxGgNfRUAAABC pid=<0.416.0> module=Plug.Logger [info]  Sent 200 in 233ms
15:26:17.304 pid=<0.297.0> module=Chroxy.ProxyListener [info]  Connection accepted, spawning proxy server to manage connection
15:26:17.309 pid=<0.418.0> module=Chroxy.ProxyServer [debug] Resolved dynamic hook `%{mod: Chroxy.ChromeProxy, ref: #PID<0.417.0>}` for ProxyServer
15:26:17.311 pid=<0.418.0> module=Chroxy.ProxyServer [debug] Downstream connection established

But eventually we get in a bad state, and the Logs we get back look like this:

21:25:41.828 pid=<0.9502.0> module=Plug.Logger [info]  GET /api/v1/connection
21:25:42.032 pid=<0.9503.0> module=Chroxy.ChromeProxy [debug] Obtained new page from ChromeServer [#PID<0.9043.0>]
21:25:42.032 pid=<0.1328.0> module=Chroxy.ProxyRouter [debug] put object with key: 4212795C766AB0B83DFC89E687F4A0CB - value: #PID<0.9503.0>
21:25:42.033 pid=<0.8763.0> module=Chroxy.ProxyListener [info]  Connection accepted, spawning proxy server to manage connection
21:25:42.033 request_id=FfTqzoTtW2ZgnYUAAAri pid=<0.9502.0> module=Plug.Logger [info]  Sent 200 in 205ms
21:25:42.033 pid=<0.9504.0> module=Chroxy.ProxyServer [debug] Resolved dynamic hook `%{mod: Chroxy.ChromeProxy, ref: #PID<0.9498.0>}` for ProxyServer
21:25:42.034 pid=<0.9504.0> module=Chroxy.ProxyServer [debug] Downstream connection established
21:25:42.034 pid=<0.9504.0> module=Chroxy.ProxyServer [debug] Upstream socket closed, terminating proxy
21:25:42.034 pid=<0.9498.0> module=Chroxy.ChromeProxy [info]  Proxy connection down - closing page
21:25:42.034 pid=<0.1328.0> module=Chroxy.ProxyRouter [debug] deleting object with key: C2230CCF04096F06B0AEF1A50F9AE6FA
21:25:42.034 pid=<0.1328.0> module=Chroxy.ProxyRouter [debug] received DOWN message for: #Reference<0.3639764170.3576692738.192501>

and the Client code eventually gets back a stack trace:

21:25:42.034 [debug] WebSocket: ws://127.0.0.1:1331/devtools/page/4212795C766AB0B83DFC89E687F4A0CB
** (EXIT from #PID<0.2498.0>) shell process exited with reason: an exception was raised:
    ** (MatchError) no match of right hand side value: {:error, %WebSockex.ConnError{original: :timeout}}
        (chrome_remote_interface) lib/page_session.ex:123: ChromeRemoteInterface.PageSession.init/1
        (stdlib) gen_server.erl:374: :gen_server.init_it/2
        (stdlib) gen_server.erl:342: :gen_server.init_it/6
        (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

As soon as the Downstream connection is established, we immediately get the message saying that the Upstream socket closed, and then it shuts everything down. I've poured through the code, but I just don't have any idea what's causing this.

If we shut down the entire Chroxy Application and restart it, then it goes back to working as normal. If we shut down and restart the client Application, it makes no difference.

I'm really up against a wall here so any help you could offer would be appreciated.

@thelightcosine
Copy link
Author

@holsee oh yes, and we are using
Chroxy 0.6.3
ChroxyClient 0.3.0
ChromeRemoteInterface 0.3.0 (mandated by ChroxyClient, although ChromerRmoteInterface is now up to 0.4.1)
Google Chrome 80.0.3987.87 unknown

running on CentOS 7

@holsee holsee added the help wanted Extra attention is needed label Mar 4, 2020
@holsee
Copy link
Owner

holsee commented Mar 4, 2020

Hi, thanks for reporting.

The logs suggest that the connection is being closed on the client-side (Upstream):

12:16:46.925 pid=<0.402.0> module=Chroxy.ProxyServer [debug] Upstream socket closed, terminating proxy

Which once detected, triggers the automatic clean-up of the Chrome Page:

12:16:46.925 pid=<0.398.0> module=Chroxy.ChromeProxy [info]  Proxy connection down - closing page

When the page is deleted, the ProxyRouter performs its clean-up:

Will monitor browser processes and automatically remove registrations if a browser process dies during the course of operation.

12:16:46.925 pid=<0.354.0> module=Chroxy.ProxyRouter [debug] deleting object with key: EC376F766AA221DADE3B42EBB0FBB697

12:16:46.925 pid=<0.354.0> module=Chroxy.ProxyRouter [debug] received DOWN message for: #Reference<0.4234221939.3870294018.89253>

I will spend some time to investigate and refamiliarise myself but at first inspection, it would appear the client-side is disconnecting which results in the chrome page being closed.

@holsee
Copy link
Owner

holsee commented Mar 4, 2020

I have confirmed my hypothesis and have created 3 videos which highlight how this was done, these are all time sync'd so run them side by side:

Client Shell: https://asciinema.org/a/JV2lWPbVl7wj96WAMdFTjXrDm
Server Shell: https://asciinema.org/a/1jGmlUXh4ldfhjMFqWdTpADKo
Shell: https://asciinema.org/a/kS83YSYq4RERKuOplHxZ6fpa

image

In summary, the logs you have reported are consistent with the client socket connection closing, there are many reasons why this could happen but none of which are in the control of Chroxy AFAIK.

Thanks for the report, closing this case.

@holsee holsee closed this as completed Mar 4, 2020
@holsee holsee added wontfix This will not be worked on and removed help wanted Extra attention is needed labels Mar 4, 2020
@thelightcosine
Copy link
Author

@holsee Here's the problem with that analysis. When we get in this state, closing and restarting the client side does absolutely nothing for us. It comes right back up and continues to fail in the same way. Restarting the Chroxy server is the only thing that resolves the issue, until the next time we end up in that state.

@holsee
Copy link
Owner

holsee commented Mar 19, 2020

I can only diagnose based on the information you provide and what you provided was a clear code path that the client is failing to establish a connection more specifically it is closing the connection. That part of the analysis is valid. The reason why is it happening is unclear.

  • How many concurrent connections are you establishing and what is your pool of chrome instances (controlled by ports configuration) size set to?
  • What environment are you running this application in?
  • How many sessions have completed before this becomes an issue (roughly)?

Enabling chrome level logging would also help provide insight: config/config.ex

config :chroxy, Chroxy.ChromeServer,
  page_wait_ms: envar.("CHROXY_CHROME_SERVER_PAGE_WAIT_MS") || "200",
  crash_dumps_dir: envar.("CHROME_CHROME_SERVER_CRASH_DUMPS_DIR") || "/tmp",
  verbose_logging: 1 # <- set to 1

Another hypothesis would be that the connections are maybe not being cleaned up, either on the client-side, chroxy-side or chrome side and after a certain number of connections no further connects can be established. But if you were hitting these limits you would see emfile errors due to the system running out of file descriptors.

  • Can you provide a snippet of the client-side code which is used to mange a connection session, please?
  • Are you closing connections when you are done with them?

@holsee holsee reopened this Mar 19, 2020
@holsee holsee added help wanted Extra attention is needed and removed wontfix This will not be worked on labels Mar 19, 2020
@holsee
Copy link
Owner

holsee commented Mar 19, 2020

I am performing a simple test to try and recreate the issue:

Set pool size to 10 instances of Chrome in Chroxy and start server:

export CHROXY_CHROME_PORT_FROM=9222
export CHROXY_CHROME_PORT_TO=9232
# start prod build with connected shell
MIX_ENV=prod iex -S mix

Running from a simple iex session to simulate a client:

url = "http://0.0.0.0:1000"
concurrent_tests = 20 # twice pool size, mitigates client timeouts on ws request
wait_ms = 0
total_tests = 1_000_000
error_report_process = self()

test_session = fn n -> 
  try do
	  IO.puts("#{n} started...")
	  pid = ChroxyClient.page_session!()
	  ChromeRemoteInterface.RPC.Page.navigate(pid, %{url: url})
	  Process.sleep(wait_ms) 
	  ChromeRemoteInterface.PageSession.stop(pid)
	  IO.puts("#{n} done!")
  rescue 
	 err  -> 
     IO.puts("#{n} failed with error: #{inspect(err)}")
     send(error_report_process, {:captured_error, n, err})
  end
end

# Create test stream which will perform $total_tests running $conurrent_tests at the same time.
test_stream = Task.async_stream(1..total_tests, test_session, max_concurrency: concurrent_tests, ordered: false, timeout: :infinity) 

# Run the tests in another process to we can access REPL to flush for error reports
Task.start(fn -> Stream.run(test_stream) end)

# get reported client errors
flush()

Will report back later.

@holsee
Copy link
Owner

holsee commented Mar 19, 2020

I have been able to put the system into an error state after on average between 1.5k to 3.5k tests. This is indeterministic and the system is recovering in most cases, allowing for new connections after a short restart period.

The reason for the issue is not 100% clear, I will try and spend some more time on this over the coming days.

Another thing which is apparent is the logs do not correspond to what you have provided. as, what I previously stated, those which you posted are normal opperation.

Chrome itself also seems to be more stable now that it ever has, which is good news given how problematic that has been in the past and how so much of what Chroxy offers is based around chrome process issue detection and restart.

So next steps for me will be to add deeper debug logic, perform a review of the process structure within Chroxy and try to reason about restart logic in the OTP supervision tree. This will take some time and I am fulltime on other projects atm. I will report back again in due course.

In the meantime, I have found serverless headless chrome to be an interesting option, you might too. As for chroxy, a periodic restart trigger and retry logic should work-a-round the issue until the issue can be indentified and resolved.

@thelightcosine
Copy link
Author

@holsee Wow, what a response! I apologise for how my last comment may have come across. You're correct that your analysis is valid, what i should have said is that, unfortunately, your conclusion doesn't add up for us.

Here is the contents of our entire config.exs for the Chroxy server (which is being built into a release using Distillery and deployed along with the client application on the same server)

# This file is responsible for configuring your application
# and its dependencies with the aid of the Mix.Config module.
use Mix.Config

config :logger,
  :console,
  metadata: [:request_id, :pid, :module],
  level: :debug

config :chroxy,
  chrome_remote_debug_port_from: "9222",
  chrome_remote_debug_port_to:  "9223"

config :chroxy, Chroxy.ProxyListener,
  host: "127.0.0.1",
  port: "1331"

config :chroxy, Chroxy.ProxyServer, packet_trace: false

config :chroxy, Chroxy.Endpoint,
  scheme: :http,
  port: "1330"

config :chroxy, Chroxy.ChromeServer,
  page_wait_ms:  "500",
  crash_dumps_dir: "/tmp",
  verbose_logging: 1,
  chrome_flags: [
    "--user-agent=\"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36\"",
    "--disable-dev-shm-usage",
    "--disable-features=VizDisplayCompositor",
    "--enable-features=NetworkService,NetworkServiceInProcess",
    "--headless",
    "--disable-gpu",
    "--no-sandbox",
    "--ignore-certificate-errors",
    "--window-size=1280,768",
    "--hide-scrollbars",
    "--disable-translate",
    "--disable-extensions",
    "--disable-background-networking",
    "--safebrowsing-disable-auto-update",
    "--disable-sync",
    "--disable-default-apps",
    "--disable-background-timer-throttling",
    "--disable-renderer-backgrounding",
    "--disable-device-discovery-notifications",
    "--incognito",
    "--allow-running-insecure-content",
    "--v8-cache-options=off",
    "--aggressive-cache-discard",
    "--disable-cache",
    "--disable-application-cache",
    "--disable-offline-load-stale-cache",
    "--no-first-run"
  ]

The code in our Client application that launches the session (which is where it fails immediately in the bad state) is here:

  def start_session do
    retry with: exponential_backoff() |> randomize |> expiry(10_000),
          rescue_only: [MatchError] do
      session = ChroxyClient.page_session!(%{host: "localhost", port: 1330})
      {:ok, session}
    after
      result -> result
    else
      error -> error
    end
  end

We do call PageSession.stop(pid) when we are done with any given operation. During the last time I had the issue active I also attached to the Chroxy instance and confirmed that we did not have zombie PageSessions sitting around filling up the pool.

i've also watched as best I could when we have this problem. It gets all the way to establishing the connection with Chrome, and as soon as that comes back, it claims the downstream connection was closed, but our Client code hasn't closed the socket and then thinks that the connection timed out when the underlying socket gets closed out.

I may try to grab a tcpdump if I can the next time we see this. It's a very frustrating issue, because Chroxy works great 99% of the time and then we have this issue. Chroxy is far superior to many of the other things we've tried such as Hound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants