Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precision on date_group() is clamped at max value for unit instead of performing numerical conversion when n > max #359

Open
matiasandina opened this issue Aug 11, 2023 · 2 comments

Comments

@matiasandina
Copy link

matiasandina commented Aug 11, 2023

I am not sure if this is by design or not, but I really like the behavior of date_group() and it just exploded in my face when trying to use it with more than 60 minutes. Here's a reproducible example:

now <- Sys.time()
dts <- clock::date_seq(from = now, to =  now + lubridate::hours(5), by = 600)
tibble::tibble(dts = dts,
               min_60 = clock::date_group(dts, precision = "minute", n = 60),
               # gets clamped at 1 hour mark
               min_120 = clock::date_group(dts, precision = "minute", n = 120),
               h_2 = clock::date_group(dts, precision = "hour", n = 2)
)

# A tibble: 31 × 4
   dts                 min_60              min_120             h_2                
   <dttm>              <dttm>              <dttm>              <dttm>             
 1 2023-08-11 16:38:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 2 2023-08-11 16:48:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 3 2023-08-11 16:58:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 4 2023-08-11 17:08:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 5 2023-08-11 17:18:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 6 2023-08-11 17:28:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 7 2023-08-11 17:38:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 8 2023-08-11 17:48:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
 9 2023-08-11 17:58:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00
10 2023-08-11 18:08:29 2023-08-11 18:00:00 2023-08-11 18:00:00 2023-08-11 18:00:00
# ℹ 21 more rows
# ℹ Use `print(n = ...)` to see more rows

It would be nice for clock to automatically perform the conversion. If difficult to change, I think there should be some sort of verbose warning to the user, since the grouping gets performed silently and gives the illusion of it working properly.

This gets properly handled as I would expect with date_floor (conversion being made), so I'm not sure if date_group() is doing what it is supposed to be doing and I should be using date_floor instead.

tibble::tibble(dts = dts,
               min_60 = clock::date_group(dts, precision = "minute", n = 60),
               # gets clamped at 1 hour mark
               min_120 = clock::date_group(dts, precision = "minute", n = 120),
               h_2 = clock::date_group(dts, precision = "hour", n = 2),
               floor = clock::date_floor(dts, precision = "minute", n = 120)
)

# A tibble: 31 × 5
   dts                 min_60              min_120             h_2                 floor              
   <dttm>              <dttm>              <dttm>              <dttm>              <dttm>             
 1 2023-08-11 16:38:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 2 2023-08-11 16:48:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 3 2023-08-11 16:58:29 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 4 2023-08-11 17:08:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 5 2023-08-11 17:18:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 6 2023-08-11 17:28:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 7 2023-08-11 17:38:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 8 2023-08-11 17:48:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
 9 2023-08-11 17:58:29 2023-08-11 17:00:00 2023-08-11 17:00:00 2023-08-11 16:00:00 2023-08-11 16:00:00
10 2023-08-11 18:08:29 2023-08-11 18:00:00 2023-08-11 18:00:00 2023-08-11 18:00:00 2023-08-11 18:00:00
# ℹ 21 more rows
# ℹ Use `print(n = ...)` to see more rows
@DavisVaughan
Copy link
Member

The point of date_group() when used at "minute" precision is to group by n minutes within the current hour.

i.e. the "counter" which decides how to break up the groups resets at every hour.

Here is one of the Examples that tries to demonstrate this

x <- as.POSIXct("2019-01-01", "America/New_York")
x <- add_days(x, -3:5)

# Group by 2 days of the current month.
# Note that this resets at the beginning of the month, creating day groups
# of [29, 30] [31] [01, 02] [03, 04].
date_group(x, "day", n = 2)

If you just want to create buckets of 120 minutes anchored from an initial origin of 1970-01-01 00:00:00 then date_floor() is the right function for you.

date_floor() and date_group() work via very different algorithms for very different purposes
https://clock.r-lib.org/articles/clock.html#grouping-rounding-and-shifting

@matiasandina
Copy link
Author

Got it. My package function is a glorified wrapper of date_group() that basically bins before processing:

  data <- data %>%
    dplyr::mutate(bin = clock::date_group({{time_col}}, n = n, precision = precision))

I guess they should not change from minutes to hours like it's no big deal hoping for internal conversion. I am not sure whether edge cases should favor date_group() or floor_date() and it's a bit tough to think about the different use cases for the binning function (i.e., what do they expect when binning? should they try to use the same function to bin minutes, hours, and days?). It's likely that the floor_date() is the most intuitive behavior here, but I'm a bit uncertain because the design choice was to give freedom to the user (they can select whatever bin with one funtion). Unfortunately, they can come up with bins that produce different behavior:

x <- as.POSIXct("2019-01-01 23:00:00", "America/New_York")
x <- clock::add_minutes(x, seq(-119, 121, 18))

x
 [1] "2019-01-01 21:01:00 EST" "2019-01-01 21:19:00 EST" "2019-01-01 21:37:00 EST" "2019-01-01 21:55:00 EST"
 [5] "2019-01-01 22:13:00 EST" "2019-01-01 22:31:00 EST" "2019-01-01 22:49:00 EST" "2019-01-01 23:07:00 EST"
 [9] "2019-01-01 23:25:00 EST" "2019-01-01 23:43:00 EST" "2019-01-02 00:01:00 EST" "2019-01-02 00:19:00 EST"
[13] "2019-01-02 00:37:00 EST" "2019-01-02 00:55:00 EST"


# using `n` = 25 because 25 is not a divisor of 60 
# intuitively weird that the first block is 20:45 and not 21:00
clock::date_floor(x, "minute", n = 25)
 [1] "2019-01-01 20:45:00 EST" "2019-01-01 21:10:00 EST" "2019-01-01 21:35:00 EST" "2019-01-01 21:35:00 EST"
 [5] "2019-01-01 22:00:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 22:50:00 EST"
 [9] "2019-01-01 23:15:00 EST" "2019-01-01 23:40:00 EST" "2019-01-01 23:40:00 EST" "2019-01-02 00:05:00 EST"
[13] "2019-01-02 00:30:00 EST" "2019-01-02 00:55:00 EST"

# This is behavior might make more sense in this particular case
clock::date_group(x, "minute", n = 25)
 [1] "2019-01-01 21:00:00 EST" "2019-01-01 21:00:00 EST" "2019-01-01 21:25:00 EST" "2019-01-01 21:50:00 EST"
 [5] "2019-01-01 22:00:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 22:25:00 EST" "2019-01-01 23:00:00 EST"
 [9] "2019-01-01 23:25:00 EST" "2019-01-01 23:25:00 EST" "2019-01-02 00:00:00 EST" "2019-01-02 00:00:00 EST"
[13] "2019-01-02 00:25:00 EST" "2019-01-02 00:50:00 EST"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants