-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling null ga_session_id in event_key creation #340
Comments
At first glance, I'm on board with this change. We would also need to filter out the formerly null session keys from the down-stream session tables so that metrics don't shift suddenly or add a configuration that defaults to filtering these out. For consistencies sake, we probably want to filter out events with the new null event keys as well and add a default configuration here too. Longer term, we should think about adding modeling options or instructions to the package similar to GA4's data modeling that would use these "null" key events and sessions. I would imagine that the I don't actually know how to do the actual modeling? Is there a statistical function in BQ that does this already? We don't actually need to show modeling in the package but I do think that the package should give people the option to model their data and that means that we need to handle this null issue with that in mind. @adamribaudo-velir do you have any thoughts on handling null |
Actually, it occurs to me that the modeled metrics would be a simple calculation of figuring out the metric value per event or per session, multiplying that value by the number of non consented events or sessions and then adding that to the actual metric: This makes sense to do as a macro. Also, instead of having two staging models, like in my last comment, we include the consented and non-consented rows in the same model and then filter out the non-consented in our marts unless someone configures the filter variable to include non-consented data. I see this as being a two stage change. In stage 1, we implement @EvavW's suggestion and add a In stage 2, we create the I want to separate the second stage because we have a lot of logic in our session staging tables that filters out null |
Thanks for your response! I like the approach of configuring whether to include previously null sessions. On the question of naming, not every instance of a null Could you explain your thinking a bit more on why we would need a special macro to calculate metric value when including previously-null sessions? For a given metric there should be values populated for previously null sessions, so why would we need to derive metric value for previously null sessions from non-null session metric values? |
Do you know what causes I was thinking to mimic the behavioral modeling feature in GA4. This is surplus to the scope of the issue that you've presented here. I'm thinking maybe this solution:
This way data users can clearly identify the IDs that are null by the client and session key values. Then, we add a variable that controls whether to include these values. Do you want to code this up and get your name added to the contributor list @EvavW? Any modeling stuff we will keep separate. |
Sure I'd be happy to open a PR for this. I'm not sure why I'm rethinking this after looking at your solution. Populating We could leave
|
The various session models include some My solution follows a best practice to replace null values with explanatory values however the one that you showed does not require any meddling with the session models and storing null values is free in BQ. I'm fine with either solution. There are good-enough reasons to choose either. We need two types of logic when modeling the missing data. Event-scope logic and session-scope logic. Writing the two out below, it seems like for event-scope logic we want to have the rows with null-replaced keys, but for sessions, we don't need rows with null-replaced keys. EventsThis model is built on top of the existing fact and dim tables or possibly as a replacement for
SessionsFor sessions, we just need to count consented sessions and count consented events to get a events_per_session and then use inverse of that to calculate modeled session total based on events_per_session and total of consented and non-consented events. This model is built on top of the existing fact and dim tables and requires a new daily total metrics table for the above calculations.
The above code is rough and only meant to demonstrate how you would use the 'null' values in downstream models. |
Thanks for that explanation, I'll look out for other explanatory values in the package. On the investigative side, we're still looking into potential causes of missing ga_session_id but we believe tentatively that it is actually related to consent as well. Another note is that at least in our case, we only have missing ga_session_ids in the Firebase stream. We are looking into the possibility that there is a better parameter specific to Firebase that might show up more consistently than ga_session_id. While we still try to figure all that out, I'll open a PR for the solution that only fills in nulls in event_key. When we have some answers I could check back in about a potential second PR to put in some explanatory values for client_key and session_key that we feel confident about. |
@dgitis Also, could you grant me permission to contribute to the repo? |
Contributions are blocked so that changes can be reviewed properly. Just create the PR, run the various test suites, and then request for me and/or @adamribaudo-velir to review it. Once the PR has been approved, Adam will merge it into main. |
And I should point out that replacing null values with explanatory values is a best practice that we don't currently follow which is one of the reasons that I'm fine with you ignoring it but is also why I was asking which circumstances produced different combinations of null values. Ideally, we would have values like |
I'm just trying to push to a new branch, not to main
I haven't contributed to a public repo before so I might be doing something wrong here |
@dgitis Any idea of what I'm missing to be able to open a PR? |
@EvavW you'll need to fork the repo into your account first, create a branch, and then when you create the PR you should have the option to submit it against this repo Some notes here: https://stackoverflow.com/questions/77996542/submitting-a-pull-request-for-original-repository I've had it on my TODO to weigh in on this thread. Sorry for the delay. |
@adamribaudo-velir thank you! |
Opened #341 |
PR looks good. I think modeling should be outside of the scope of this project, @dgitis. It's too opinionated. There's no "right" way to model missing data and it's very context-dependent. I'm open to adding anything to the 'analysis' folder (which is never actually run by dbt) as examples, though. BigQuery has some nice methods of generating ML models from SQL. |
* generate event_key even when session_key is null * Revert "generate event_key even when session_key is null" This reverts commit 1753ca1. * generate event_key even when session_key is null * update yml * fine-tune event key * update comment * update inline comment
When
ga_session_id
is null, so isevent_key
. This is problematic when usingevent_key
to count unique events.Null
event_key
s could be avoided by usingarray_to_string
to join keys rather thanconcat
. such as:This should generate the same keys as the current approach when
ga_session_id
is not null. This same approach could also handle nulluser_pseudo_id
s.Are there any concerns with this approach?
The text was updated successfully, but these errors were encountered: