Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio standard metrics do not increase after 1hour or so even though real time traffic is flowing #51100

Open
PGpmg opened this issue May 13, 2024 · 10 comments · May be fixed by istio/proxy#5592

Comments

@PGpmg
Copy link

PGpmg commented May 13, 2024

Hi,

We have recently upgraded Istio from 1.16.4 to 1.19.1 and started noticing an issue where Istio standard metrics such as istio_tcp_received_bytes_total, istio_tcp_sent_bytes_total counters do not increase after 1 hour or so of deployment.

Once we restart the application k8s pods or update the istio configmap by removing and adding back defaultProviders:metrics:prometheus, istio standard metrics start working again but stops after sometime.

current Istio version: 1.19.1
K8s version: 1.26

We have noticed the same issue with Istio 1.21.2 version in our environment.

Any leads to resolve this issue would be highly appreciated.

@SanjayaKumarSahoo
Copy link

Request to help on this since the issue happenning in sporadic manner.

@kyessenov
Copy link
Contributor

CC @zirain This sounds like related to metric rotation? We need prometheus to scrape often enough.

@zirain
Copy link
Member

zirain commented May 15, 2024

metric rotation disabled by default.

@SanjayaKumarSahoo
Copy link

SanjayaKumarSahoo commented May 16, 2024

Thanks for the input,

We tried to put env variable METRIC_ROTATION_INTERVAL as 10s in pilot config, intially the metrics started flowing in then observed that the metrics stopped coming. So we have to removet the above env variable.

When we do port-forwarding to envoy metrics of pod "http://{host}:15090/stats/prometheus", we observed that the TCP sent / received bytes counters are not incresing, even if the data processing is happenning.

Request to help on this.


@zirain
Copy link
Member

zirain commented May 16, 2024

istio/api#3121 (comment)

@zirain zirain pinned this issue May 16, 2024
@zirain zirain unpinned this issue May 16, 2024
@zirain zirain transferred this issue from istio/istio.io May 16, 2024
@PGpmg
Copy link
Author

PGpmg commented May 27, 2024

While investigating this issue in detail, discovered a behavioural change between the 1.18.7 and 1.19.0 versions.
Tested with a sample application which writes data to the DB continuously in loop.

Observations with 1.18.7 and below versions:
As application writes data to the DB, we could see the Istio tcp standard metrics getting populated immediately and keeps increasing.

PG@C02G40YWMD6M istio-1.18.7 % bin/istioctl x es pvos-switch-state-publisher-c4f47bf8d-6r7qn.acp-system -oprom | grep _bytes_total
TYPE envoy_cluster_upstream_cx_rx_bytes_total counter
envoy_cluster_upstream_cx_rx_bytes_total{cluster_name="xds-grpc"} 102980
TYPE envoy_cluster_upstream_cx_tx_bytes_total counter
envoy_cluster_upstream_cx_tx_bytes_total{cluster_name="xds-grpc"} 38003
TYPE istio_tcp_received_bytes_total counter
istio_tcp_received_bytes_total{reporter="source",source_workload="pvos-switch-state-publisher",source_canonical_service="pvos-switch-state-publisher",source_canonical_revision="latest",source_workload_namespace="acp-system",source_principal="unknown",sou
rce_app="pvos-switch-state-publisher",source_version="",source_cluster="Kubernetes",destination_workload="cnx-arango-cluster-crdn-4e7dkxzr-53ae10",destination_workload_namespace="default",destination_principal="unknown",destination_app="unknown",destinat
ion_version="unknown",destination_service="cnx-arango-cluster.default.svc.cluster.local",destination_canonical_service="arangodb",destination_canonical_revision="latest",destination_service_name="cnx-arango-cluster",destination_service_namespace="default
",destination_cluster="Kubernetes",request_protocol="tcp",response_flags="-",connection_security_policy="unknown"} 81077
TYPE istio_tcp_sent_bytes_total counter
istio_tcp_sent_bytes_total{reporter="source",source_workload="pvos-switch-state-publisher",source_canonical_service="pvos-switch-state-publisher",source_canonical_revision="latest",source_workload_namespace="acp-system",source_principal="unknown",source_app="pvos-switch-state-publisher",source_version="",source_cluster="Kubernetes",destination_workload="cnx-arango-cluster-crdn-4e7dkxzr-53ae10",destination_workload_namespace="default",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="cnx-arango-cluster.default.svc.cluster.local",destination_canonical_service="arangodb",destination_canonical_revision="latest",destination_service_name="cnx-arango-cluster",destination_service_namespace="default",destination_cluster="Kubernetes",request_protocol="tcp",response_flags="-",connection_security_policy="unknown"} 110884

Observations with 1.19.0 and above versions till 1.22.0:
As application writes data to the DB, we could see the Istio tcp standard metrics getting populated only after connection gets added to cleanup list. As part of istio proxy debug logs, we could see log "adding to cleanup list".
Until the connection is terminate, we don't see the Istio standard metrics.

PG@C02G40YWMD6M istio-1.19.0 % bin/istioctl x es pvos-switch-state-publisher-c4f47bf8d-l72wr.acp-system -oprom | grep _bytes_total
TYPE envoy_cluster_upstream_cx_rx_bytes_total counter
envoy_cluster_upstream_cx_rx_bytes_total{cluster_name="xds-grpc"} 167743
TYPE envoy_cluster_upstream_cx_tx_bytes_total counter
envoy_cluster_upstream_cx_tx_bytes_total{cluster_name="xds-grpc"} 41407

NOTE: till Istio 1.18.7, in our clusters, stats are being captured properly. There is no issues of metrics not being reported.
But with 1.19.0 version onwards, Istio stops reporting after 1hour or so.

Could you please help understand this behavioural change in detail and could this be a concern in getting the proper stats?

@SanjayaKumarSahoo
Copy link

SanjayaKumarSahoo commented May 27, 2024

Hi @kyessenov, in 1.19 we can see there is PR (istio/proxy#4887) raised on metadat exchange, do you see this could introduce the above bevaior ? Request your input on this.

@PGpmg
Copy link
Author

PGpmg commented May 31, 2024

Found an issue in the istio proxy code. Please check.

Issue:
As part of updating the peerId in case of metadata not found, key used is "was.envoy.wasm.metadata_exchange.peer_unknown" (Here, wasm prefix is added later in updatePeerId() function). But while fetching the peerInfo, "envoy.wasm.metadata_exchange.peer_unknown" key is used without "wasm" prefix.

Updating peer id:
https://github.com/istio/proxy/blob/1.19.0/source/extensions/filters/network/metadata_exchange/metadata_exchange.cc#L314

Fetching the peerInfo:
https://github.com/istio/proxy/blob/1.19.0/source/extensions/filters/http/istio_stats/istio_stats.cc#L97

This setting of "envoy.wasm.metadata_exchange.peer_unknown" changes were added in 1.19.0, before the keys were either downstream_peer_id/upstream_peer_id. Hence, stats were coming with till 1.18.7 versions and stopped from 1.19.0 version.

As @SanjayaKumarSahoo pointed out, this got introduced as part of PR istio/proxy#4887.
@kyessenov , Could you please help fix this issue ASAP?

@PGpmg
Copy link
Author

PGpmg commented Jun 6, 2024

any update on this issue please?

@zirain
Copy link
Member

zirain commented Jun 6, 2024

kuat is out, will take a looks if I have bandwith, I think it happen after 1h because the idle time is default to 1h.

@zirain zirain linked a pull request Jun 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants