-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/early stop #109
Feat/early stop #109
Conversation
Without
However, it induces panic if I add
@dae, @nathanielsimard, could you help me? |
Can you verify that the file actually exists, early stopping is reading the file logged at |
To speed up the training, dae disabled the logger. I think it's unsafe to read file during training. And the file would also induce other errors. For example, our project trains the model concurrently. So the log file will be written by multiple loggers. It causes float number parsing error. |
@L-M-Sherlock Early stopping involves using metrics collected through loggers with an event store. If you are using There is also an in-memory logger available, but it would consume a significant amount of RAM to store all events in-memory. However, it can be a viable option if your intention is to only use I'm going to enable that option with the builder! |
In fact you can already enable it with the following: builder
.metric_loggers(InMemoryMetricLogger::default(), InMemoryMetricLogger::default()) Let me know if it helps! |
@L-M-Sherlock we could probably solve the file corruption issue by using a separate artifact dir for each of the training runs we invoke, but I'd suggest we try the in-memory approach first, as I'd like to avoid file I/O if possible. |
It improves the model slightly: open-spaced-repetition/srs-benchmark@665a423 |
Curiously, I get different results when I rebuild the result files on my machine:
|
@dae If you use multiple threads to load the data, then the order in which the items are loaded isn't deterministic. It might explain the difference. |
Sorry, I probably shouldn't have mentioned THREADS - we use that for processing different files in parallel, so it shouldn't affect things. But it does appear we have some non-determinism somewhere in our code, as back-to-back runs are producing different values. |
If you use multiple workers for the dataloader, you have non-determinism! |
We use a single worker for the dataloader. We're running multiple training runs in parallel, with separate data, and then determining the mean values of the outputs. @L-M-Sherlock maybe it's caused by the use of a hashmap in filter_outlier? |
It is only used by the pretrain. |
Waiting for benchmarks.