Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplemented email analytics prioritizing email opens #20914

Merged
merged 15 commits into from
Sep 5, 2024

Conversation

9larsons
Copy link
Contributor

@9larsons 9larsons commented Sep 3, 2024

ref #20835

  • reimplemented email analytics changes that prioritized opened events over other events in order to speed up open analytics
  • added db persistence to fetch missing job to ensure we re-fetch every window of events, especially important if we restart following a large email batch

We learned a few things with the previous trial run of this. Namely, that event throughput is not as high as we initially saw in the data for particularly large databases. This set of changes is more conservative, while a touch more complicated, in ensuring we capture edge cases for really large newsletter sends (100k+ members).

In general, we want to make sure we're fetching new open events at least every 5 mins, and often much faster than that, unless it's a quiet period (suggesting we haven't had a newsletter send or much outstanding event data).

If we conservatively (somewhat) figure ~2500 events/min, and that the job is on a 5min cron, we should try to keep the loop cycle to ~5 mins or less in order to repeat the fetch logic, which now prioritizes opens.

Open data is the only data to update any stats, so we want to make sure if people are pressing refresh that we do our best to keep it up to date.

Meanwhile, if we don't find opens, we can move on to filling in the other event data, collecting missed events, and scheduled backfills, in that order.
This effectively forces the jobs to fetch the last run timestamps to set in the db, because this was previously implemented for a period and we didn't null these values when reverting that commit.
…usly with handles reboots to ensure we re-fetch every event to capture missing events
@github-actions github-actions bot added the migration [pull request] Includes migration for review label Sep 3, 2024
Copy link
Contributor

github-actions bot commented Sep 3, 2024

It looks like this PR contains a migration 👀
Here's the checklist for reviewing migrations:

General requirements

  • Satisfies idempotency requirement (both up() and down())
  • Does not reference models
  • Filename is in the correct format (and correctly ordered)
  • Targets the next minor version
  • All code paths have appropriate log messages
  • Uses the correct utils
  • Contains a minimal changeset
  • Does not mix DDL/DML operations
  • Tested in MySQL and SQLite

Schema changes

  • Both schema change and related migration have been implemented
  • For index changes: has been performance tested for large tables
  • For new tables/columns: fields use the appropriate predefined field lengths
  • For new tables/columns: field names follow the appropriate conventions
  • Does not drop a non-alpha table outside of a major version

Data changes

  • Mass updates/inserts are batched appropriately
  • Does not loop over large tables/datasets
  • Defends against missing or invalid data
  • For settings updates: follows the appropriate guidelines

@9larsons 9larsons changed the title Reimplement prioritize email opens Reimplemented email analytics prioritizing email opens Sep 3, 2024
@9larsons 9larsons requested a review from cmraible September 3, 2024 20:32
@9larsons
Copy link
Contributor Author

9larsons commented Sep 3, 2024

I'm thinking we may want to include some sort of backstop for the jobs table data in case it isn't cleared during this migration, the migration isn't run, etc. I don't know what would be reasonable and I also don't want to obscure the existing code with weird handling of cases like this. Ideas?

- updated batch data to aggregate to one message with timings
- updated aggregate queries to log times so we don't need to calculate it
- removed trying to run aggregation when we don't have any events, which led to extra logging and is unnecessary

The logging in general is quite verbose and more like debug logs. I'm good with that for the moment while we're working on this, but we can still be a bit more concise.
@cmraible
Copy link
Collaborator

cmraible commented Sep 3, 2024

I don't know what would be reasonable and I also don't want to obscure the existing code with weird handling of cases like this. Ideas?

Maybe a silly idea, but could we use a new name for the job, so it wouldn't even look at the same rows if the migration doesn't run/fails for some reason? Maybe we could introduce a concept of job version, so the rows would become e.g. email-analytics-latest-opened-v2 or something along those lines? And if we ever come across a similar challenge in the future where we change the functionality of the job, we apply the same pattern and increment to v3?

@9larsons
Copy link
Contributor Author

9larsons commented Sep 3, 2024

While that does work, I don't see a need to version and that makes future migrations to try and clear this out extra messy. I'd personally rather avoid that.

@cmraible
Copy link
Collaborator

cmraible commented Sep 4, 2024

I think I'm missing something here, maybe you can help me understand. I'm not sure I see how this avoids us missing events with fetchMissing. I see the main changes here are that we're fetching a max of 10k opened events, rather than infinity.

As I understand this logic, if we fetch 10k opened events, we'll stop there and wait for the job to kick off again. If we end up fetching 10k opened events for a couple iterations in a row, couldn't we fall behind on fetchMissing again?

Again, I'm probably just confused! Maybe we can get on a call tomorrow AM to go over it 😃

These call the aggregateStats query. There was a bug in the second email set, which was set to a count of 6 but has no fixture data for email recipients, so it was false. This should not be a fixed field but probably dynamic... though this part of the code base is a bit fraught as it is, and I'm reluctant to touch anything in this context.
9larsons and others added 2 commits September 5, 2024 07:52
…3-20-09-40-null-analytics-jobs-timings.js

Co-authored-by: Daniel Lockyer <[email protected]>
@9larsons 9larsons merged commit a47298a into main Sep 5, 2024
21 checks passed
@9larsons 9larsons deleted the reimplement-prioritize-email-opens branch September 5, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migration [pull request] Includes migration for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants