Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use flatware to parallelize CI specs #30284

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mjankowski
Copy link
Contributor

Previous versions of this:

Original PR - #27663 - has the most history of performance improvements here, but seems stuck on it's codecov/gh-actions integration ... I'm guessing this may be a result of the codecov action having been version bumped while the PR was open, and am going to replace that PR with this one as a final attempt here.

At this point I do see the same coverage percentages locally on main and this branch, and as far as I can tell the codecov integration should be working.

@mjankowski
Copy link
Contributor Author

CodeCov still showing a drop here. This is sort of hard to debug given how slow it is to iterate and that I’m not 100% sure how Codecov processes the uploaded files. One theory is that the formatter used is impacting the differences in reported results, and/or the combining of results. Did a bunch of local trial and error, and see:

Style Formatter LOC Results Percentage
Plain HTML 24858 / 28046 88.63%
Plain LCOV 25065 of 27769 90.3%
Plain XML 25065 / 27769 90.26%
Plain JSON 25065 / 27769 90.26%
Flatware HTML 24859 / 28046 88.64%
Flatware LCOV 25065 of 27769 90.3%
Flatware XML 24190 / 27769 87.11%
Flatware JSON 25065 / 27769 90.26%

Notably here, the local results (for HTML, JSON, LCOV) do seem to be combining correctly, with the XML format having a small drop off. I also experimented with disabling/enabling the branch coverage, but that doesn’t seem to impact.

When I download the generated file which is sent from GH actions to Codecov and run genhtml on it locally, I don’t see the same results as what Codecov shows in the UI.

Not sure how to proceed.

@mjankowski
Copy link
Contributor Author

@nschonni I think you originally added the codecov integration here - #23868 - any ideas on what we need to do here to get this output aligned with whatever it expects?

@nschonni
Copy link
Contributor

Not really sure, but it seems like they don't like the HTML reporter https://docs.codecov.com/docs/supported-report-formats

@mjankowski
Copy link
Contributor Author

Did more digging here. When I download the file attached to this commit - https://app.codecov.io/github/mastodon/mastodon/commit/cba103e23348888aed92057a185f957b2708ce13 - which was uploaded from gh actions to codecov, and I run it through genhtml locally to see the summary of the lcov data...

  • I see results locally of "82.1% (23069 of 28114 lines)"
  • But the CodeCov report on the PR is showing 76.78% coverage and calls that a loss of -5.28% from the BASE
  • Add those together and you get back to 82.06, which is presumably what I see (rounded) locally

In the current commit history on the PR - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - you can my initial commit (using LCOV) dropped the current project coverage (from https://app.codecov.io/gh/mastodon/mastodon showing just over 85%) down to 76.78%; then switching to JSON formatter bumped it up to 82.06%; then reverting that and going back to LCOV was once again 76.78%

I also wanted to rule out that the specs which ONLY run on CI (triggered by PAM/OIDC/etc env vars) were not relevant. When I set all those env vars locally, I see (using default LCOV format):

  • Main - lines.......: 90.6% (25150 of 27769 lines)
  • Flatware - lines.......: 90.6% (25150 of 27769 lines)

So it doesn't seem to be that either ... and then I also have no idea how this local 90.6% turns into the reported ~85% on codecov. Hmm.

@mjankowski
Copy link
Contributor Author

Unrelated to this flatware branch, when I download the coverage data for the current latest commit main - from https://app.codecov.io/gh/mastodon/mastodon/commit/3a7aec2807089a004db90851c66db0a007a18a48 - and then generate data for it locally...

  • Locally I see "90.6% (25148 of 27769 lines)"
  • CodeCov shows "24495 of 28736 lines covered" and 85.24%

All of which is to say -- I don't know what process codecov uses to convert the uploaded coverage data into the report it shows, and thus I have no idea how to debug this further.

If this PR is contingent on getting codecov to match, I'll likely just abandon it.

If we'd like to get the parallelism speedups and are comfortable with either a) this one-off codecov change that we can't explain, b) removing codecov integration -- that probably works. Or I guess if someone else has ideas here.

@renchap
Copy link
Sponsor Member

renchap commented May 14, 2024

Maybe you can open an issue here: https://github.com/codecov/feedback/issues ?
They seem to be responsive

@mjankowski
Copy link
Contributor Author

Good idea, opened one (linked).

@mjankowski mjankowski force-pushed the flatware-parallel-specs-rebased branch 6 times, most recently from 24b1d63 to 81b9dc0 Compare May 14, 2024 19:32
@mjankowski
Copy link
Contributor Author

RE: the source of the local differences with the HTML formatter vs any of the other formatters -- I think this comes down to the eager_load setting in test.rb environment (usually controlled by presence of CI env var). Running in this branch and doing some override of that setting, I see:

  • With html formatter and eager_load disabled — 24958 / 28030 LOC (89.04%) covered
  • With html formatter eager_load enabled — 25154 / 27774 LOC (90.57%) covered
  • With lcov formatter and eager_load enabled -- lines.......: 90.6% (25154 of 27774 lines)
  • With lcov and eager_load disabled -- lines.......: 89.0% (24958 of 28030 lines)

All of these have the full set of CI env vars enabled, hence the slightly higher numbers than typical local runs.

This is not super relevant to the codecov issues, but I wanted to get a good baseline to explain those differences so that I could be sure anything else that was changing was actually from the config, and not from formatter differences.

@mjankowski
Copy link
Contributor Author

RE: the seeming disconnect between codecov and local, I think this came down to a load order / config loading thing, and that the current codecov from this branch (after many force pushes with various things tried) does seem to match what I see locally - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - granted that is now HIGHER than the previous codecov number, rather than lower.

I'm going to pull out a separate branch which will include both an experiment re: that coverage number, and also just a useful extraction from this PR to get in first before the flatware changes. Will link to that.

@mjankowski
Copy link
Contributor Author

Other random note - the presence of tmp/rspec/examples.txt is used when present to inform how flatware splits the suite. Because I have this locally from multiple runs but CI doesn't have it, I think this was making the results I saw in terms of coverage more consistent, regardless of whether merging was working correctly or not. Made things harder to sort out.

@mjankowski
Copy link
Contributor Author

Ok, this is rebased against main after the simplecov config merged. I believe at this point I understand final discrepancy in the codecov changes -- there are a bunch of open issues/PRs against simplecov (which, side note, appears barely maintained) around the interaction of parallel testing and re-adding branch coverage (as opposed to line coverage). The current change on codecov is there because we still have line coverage but have dropped branch coverage (the setting is still on, but the results are just not generated fully) and thus codecov shows an effective coverage bump vs main because the C0 line coverage looks better than the full branch coverage.

In the previous simplecov config change commit I added a monkeypatch which prevents a nil error from being raised, but does NOT restore the branch coverage data correctly. As far I can tell, none of the open PRs there do this, and I'd have to dig much further than I really want to into simplecov to know where to start.

Options here:

  • Merge this as-is (or first more explicitly disable the branch coverage instead of just having it silently not work) -- we'll get the speedups, we'll preserve our codecov integration, we'll have this one-time seeming coverage bump (from ~85% to ~90%) which we now have an explanation for, and then we can monitor the relevant issues on simplecov repo and re-enable branch coverage if those are fixed
  • Pursue further, open our own simplecov issues, try to dig deeper for better monkey patch, etc ... and hold out for full branch coverage

Any feedback or other ideas appreciated.

@mjankowski mjankowski requested a review from a team May 24, 2024 14:40
@mjankowski mjankowski added performance Runtime performance testing Automated lint and test suites labels May 29, 2024
@mjankowski mjankowski force-pushed the flatware-parallel-specs-rebased branch from e5e8744 to f97f6b5 Compare May 31, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Runtime performance testing Automated lint and test suites
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants