Feat/early stop #109

L-M-Sherlock · 2023-10-23T08:35:15Z

Waiting for benchmarks.

L-M-Sherlock · 2023-10-23T10:10:05Z

Without .metric_valid_numeric(LossMetric::new()), it will disable the early stop:

2023-10-23T10:07:13.474593Z  WARN burn_train::learner::early_stopping: Can't find metric for early stopping.

However, it induces panic if I add .metric_valid_numeric(LossMetric::new()):

thread '<unnamed>' panicked at /Users/jarrettye/.cargo/git/checkouts/burn-acfbee6a141c1b41/d263968/burn-train/src/logger/file.rs:26:14:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /Users/jarrettye/.cargo/git/checkouts/burn-acfbee6a141c1b41/d263968/burn-train/src/metric/store/client.rs:35:56:
called `Result::unwrap()` on an `Err` value: SendError { .. }
thread '<unnamed>' panicked at /Users/jarrettye/.cargo/git/checkouts/burn-acfbee6a141c1b41/d263968/burn-train/src/metric/store/client.rs:142:40:
called `Result::unwrap()` on an `Err` value: SendError { .. }
stack backtrace:
   0:        0x100f3c698 - std::backtrace_rs::backtrace::libunwind::trace::h815672f5996c9e34
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:        0x100f3c698 - std::backtrace_rs::backtrace::trace_unsynchronized::h2e094b2d8c270437
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x100f3c698 - std::sys_common::backtrace::_print_fmt::hae24b71afea95695
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x100f3c698 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h6d4268b2ed62fb94
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x100f5b0b4 - core::fmt::rt::Argument::fmt::h835101b6de0df17f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:        0x100f5b0b4 - core::fmt::write::h5d55d44549819258
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:        0x100f3967c - std::io::Write::write_fmt::h1da98de0250c868b
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:        0x100f3c4d8 - std::sys_common::backtrace::_print::h5844dab5cfa39fca
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x100f3c4d8 - std::sys_common::backtrace::print::h2c300c1ebedfc73c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x100f3e0e0 - std::panicking::default_hook::{{closure}}::h0aa9be5c44269370
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:        0x100f3dd68 - std::panicking::default_hook::h2c0ef097934ee9e6
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:287:9
  11:        0x100d34f40 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hb681ba0e6970f985
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  12:        0x100d34f40 - test::test_main::{{closure}}::h139556dd544e4d62
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/test/src/lib.rs:136:21
  13:        0x100f3e760 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h9b865254accc03e0
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  14:        0x100f3e760 - std::panicking::rust_panic_with_hook::h84c8637cb6e56008
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:711:13
  15:        0x100f3e534 - std::panicking::begin_panic_handler::{{closure}}::h25482adda06c7b7f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:599:13
  16:        0x100f3cb24 - std::sys_common::backtrace::__rust_end_short_backtrace::h0c6f3beb22324a29
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  17:        0x100f3e298 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  18:        0x100f7b978 - core::panicking::panic_fmt::h9072a0246ecafd14
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
  19:        0x100f7bca0 - core::result::unwrap_failed::hd7600d03a3086be4
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
  20:        0x100d677c4 - <burn_train::metric::store::client::EventStoreClient as core::ops::drop::Drop>::drop::hac3de293bc98f87c
  21:        0x100cc20c0 - alloc::sync::Arc<T,A>::drop_slow::h4ad631dee4a040b0
  22:        0x100ceeed0 - burn_train::learner::train_val::<impl burn_train::learner::base::Learner<LC>>::fit::h68ffa49241ff58b3
  23:        0x100cca358 - fsrs::training::train::h5bb166eb7e87e2a1
  24:        0x100cd66a4 - core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut::hdc95c68496afae6a
  25:        0x100d098cc - rayon::iter::plumbing::Folder::consume_iter::h212f6420163efdbc
  26:        0x100d08eec - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  27:        0x100c09980 - rayon_core::join::join_context::{{closure}}::h6a27a073252e59e7
  28:        0x100d08f9c - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  29:        0x100c56fd0 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb04848c2d7031caa
  30:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  31:        0x100c09364 - rayon_core::join::join_context::{{closure}}::h5a47a8a9661a6cda
  32:        0x100d08bf4 - rayon::iter::plumbing::bridge_producer_consumer::helper::h6742b0a3a0de3f6b
  33:        0x100c56da8 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hae76227955ff1fe8
  34:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  35:        0x100f13854 - rayon_core::registry::ThreadBuilder::run::hac19021355221cc4
  36:        0x100f15b74 - std::sys_common::backtrace::__rust_begin_short_backtrace::h12f5dea2e54a2c2d
  37:        0x100f18b1c - core::ops::function::FnOnce::call_once{{vtable.shim}}::hac809be33ae58ba8
  38:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hee56534d4cc78b31
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  39:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hf1a328a6507f3fa7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  40:        0x100f42344 - std::sys::unix::thread::Thread::new::thread_start::h6244aa8f646b01ac
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  41:        0x1959a7fa8 - __pthread_joiner_wake
thread '<unnamed>' panicked at library/core/src/panicking.rs:126:5:
panic in a function that cannot unwind
stack backtrace:
   0:        0x100f3c698 - std::backtrace_rs::backtrace::libunwind::trace::h815672f5996c9e34
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:        0x100f3c698 - std::backtrace_rs::backtrace::trace_unsynchronized::h2e094b2d8c270437
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x100f3c698 - std::sys_common::backtrace::_print_fmt::hae24b71afea95695
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x100f3c698 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h6d4268b2ed62fb94
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x100f5b0b4 - core::fmt::rt::Argument::fmt::h835101b6de0df17f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:        0x100f5b0b4 - core::fmt::write::h5d55d44549819258
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:        0x100f39768 - std::io::Write::write_fmt::hc515897f91abd6cf
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:        0x100f3c4d8 - std::sys_common::backtrace::_print::h5844dab5cfa39fca
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x100f3c4d8 - std::sys_common::backtrace::print::h2c300c1ebedfc73c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x100f3e0e0 - std::panicking::default_hook::{{closure}}::h0aa9be5c44269370
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:        0x100f3de0c - std::panicking::default_hook::h2c0ef097934ee9e6
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:290:9
  11:        0x100d34f40 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hb681ba0e6970f985
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  12:        0x100d34f40 - test::test_main::{{closure}}::h139556dd544e4d62
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/test/src/lib.rs:136:21
  13:        0x100f3e760 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h9b865254accc03e0
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  14:        0x100f3e760 - std::panicking::rust_panic_with_hook::h84c8637cb6e56008
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:711:13
  15:        0x100f3e4f8 - std::panicking::begin_panic_handler::{{closure}}::h25482adda06c7b7f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:597:13
  16:        0x100f3cb24 - std::sys_common::backtrace::__rust_end_short_backtrace::h0c6f3beb22324a29
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  17:        0x100f3e298 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  18:        0x100f7b9a8 - core::panicking::panic_nounwind_fmt::ha4f6bbf8fe9fb5ae
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:96:14
  19:        0x100f7ba24 - core::panicking::panic_nounwind::ha756411eedbc9594
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:126:5
  20:        0x100f7ba94 - core::panicking::panic_cannot_unwind::h496677c8c27b4397
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:189:5
  21:        0x100ceeedc - burn_train::learner::train_val::<impl burn_train::learner::base::Learner<LC>>::fit::h68ffa49241ff58b3
  22:        0x100cca358 - fsrs::training::train::h5bb166eb7e87e2a1
  23:        0x100cd66a4 - core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut::hdc95c68496afae6a
  24:        0x100d098cc - rayon::iter::plumbing::Folder::consume_iter::h212f6420163efdbc
  25:        0x100d08eec - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  26:        0x100c09980 - rayon_core::join::join_context::{{closure}}::h6a27a073252e59e7
  27:        0x100d08f9c - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  28:        0x100c56fd0 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb04848c2d7031caa
  29:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  30:        0x100c09364 - rayon_core::join::join_context::{{closure}}::h5a47a8a9661a6cda
  31:        0x100d08bf4 - rayon::iter::plumbing::bridge_producer_consumer::helper::h6742b0a3a0de3f6b
  32:        0x100c56da8 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hae76227955ff1fe8
  33:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  34:        0x100f13854 - rayon_core::registry::ThreadBuilder::run::hac19021355221cc4
  35:        0x100f15b74 - std::sys_common::backtrace::__rust_begin_short_backtrace::h12f5dea2e54a2c2d
  36:        0x100f18b1c - core::ops::function::FnOnce::call_once{{vtable.shim}}::hac809be33ae58ba8
  37:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hee56534d4cc78b31
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  38:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hf1a328a6507f3fa7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  39:        0x100f42344 - std::sys::unix::thread::Thread::new::thread_start::h6244aa8f646b01ac
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  40:        0x1959a7fa8 - __pthread_joiner_wake
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/Users/jarrettye/Codes/open-spaced-repetition/fsrs-rs/target/release/deps/fsrs-b7858974c2ace16a` (signal: 6, SIGABRT: process abort signal)

@dae, @nathanielsimard, could you help me?

nathanielsimard · 2023-10-23T12:35:36Z

@dae, @nathanielsimard, could you help me?

Can you verify that the file actually exists, early stopping is reading the file logged at {{artifact}}/valid/epoch-1/Loss.log. The error message is missing information, I'll try to improve that!

L-M-Sherlock · 2023-10-23T13:05:08Z

Can you verify that the file actually exists, early stopping is reading the file logged at {{artifact}}/valid/epoch-1/Loss.log.

To speed up the training, dae disabled the logger. I think it's unsafe to read file during training. And the file would also induce other errors. For example, our project trains the model concurrently. So the log file will be written by multiple loggers. It causes float number parsing error.

nathanielsimard · 2023-10-23T13:38:12Z

@L-M-Sherlock Early stopping involves using metrics collected through loggers with an event store. If you are using burn-train to collect metrics, there should be no concurrency errors, as each logger operates on its own thread and communicates through a channel. This setup makes it unlikely to slow down the training process. Parsing errors may occur, but they are generally only a concern if there is file corruption, which is rare since the file is always accessed by the same thread.

There is also an in-memory logger available, but it would consume a significant amount of RAM to store all events in-memory. However, it can be a viable option if your intention is to only use LossMetric.

I'm going to enable that option with the builder!

nathanielsimard · 2023-10-23T13:57:37Z

@L-M-Sherlock

In fact you can already enable it with the following:

builder
        .metric_loggers(InMemoryMetricLogger::default(), InMemoryMetricLogger::default())

Let me know if it helps!

dae · 2023-10-23T21:41:18Z

@L-M-Sherlock we could probably solve the file corruption issue by using a separate artifact dir for each of the training runs we invoke, but I'd suggest we try the in-memory approach first, as I'd like to avoid file I/O if possible.

src/training.rs

open-spaced-repetition/fsrs-rs#109

L-M-Sherlock · 2023-10-24T02:46:39Z

It improves the model slightly: open-spaced-repetition/srs-benchmark@665a423

dae · 2023-10-24T23:51:05Z

Curiously, I get different results when I rebuild the result files on my machine:

dae@dtop:~/fsrs-benchmark% grep 'FSRS-rs ' ~/eval.1 # based on the files you checked in
FSRS-rs mean: 0.3851
FSRS-rs mean: 0.3324
FSRS-rs mean: 0.0577
dae@dtop:~/fsrs-benchmark% grep 'FSRS-rs ' ~/eval.2 # after removing the files and rebuilding them using latest burn-rs, THREADS=32 FSRS_RS=1
FSRS-rs mean: 0.3858
FSRS-rs mean: 0.3326
FSRS-rs mean: 0.0583

nathanielsimard · 2023-10-24T23:59:35Z

@dae If you use multiple threads to load the data, then the order in which the items are loaded isn't deterministic. It might explain the difference.

dae · 2023-10-25T01:30:20Z

Sorry, I probably shouldn't have mentioned THREADS - we use that for processing different files in parallel, so it shouldn't affect things. But it does appear we have some non-determinism somewhere in our code, as back-to-back runs are producing different values.

nathanielsimard · 2023-10-25T13:00:28Z

If you use multiple workers for the dataloader, you have non-determinism!

dae · 2023-10-25T23:47:22Z

We use a single worker for the dataloader. We're running multiple training runs in parallel, with separate data, and then determining the mean values of the outputs.

@L-M-Sherlock maybe it's caused by the use of a hashmap in filter_outlier?

L-M-Sherlock · 2023-10-26T02:01:09Z

maybe it's caused by the use of a hashmap in filter_outlie

It is only used by the pretrain.

L-M-Sherlock added 2 commits October 23, 2023 16:29

Feat/early stop

fad0393

cargo clippy --fix

73ba294

L-M-Sherlock added the enhancement New feature or request label Oct 23, 2023

early stop based on testset

9bab383

dae reviewed Oct 23, 2023

View reviewed changes

src/training.rs Outdated Show resolved Hide resolved

use in-memory logger & fix typo

3012af3

L-M-Sherlock added a commit to open-spaced-repetition/srs-benchmark that referenced this pull request Oct 24, 2023

fsrs-rs early stop

665a423

open-spaced-repetition/fsrs-rs#109

L-M-Sherlock requested a review from dae October 24, 2023 02:46

dae approved these changes Oct 24, 2023

View reviewed changes

dae merged commit ce6a911 into main Oct 24, 2023
3 checks passed

dae deleted the Feat/early-stop branch October 24, 2023 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/early stop #109

Feat/early stop #109

L-M-Sherlock commented Oct 23, 2023

L-M-Sherlock commented Oct 23, 2023

nathanielsimard commented Oct 23, 2023

L-M-Sherlock commented Oct 23, 2023 •

edited

Loading

nathanielsimard commented Oct 23, 2023

nathanielsimard commented Oct 23, 2023 •

edited

Loading

dae commented Oct 23, 2023

L-M-Sherlock commented Oct 24, 2023

dae commented Oct 24, 2023 •

edited

Loading

nathanielsimard commented Oct 24, 2023

dae commented Oct 25, 2023

nathanielsimard commented Oct 25, 2023

dae commented Oct 25, 2023

L-M-Sherlock commented Oct 26, 2023

Feat/early stop #109

Feat/early stop #109

Conversation

L-M-Sherlock commented Oct 23, 2023

L-M-Sherlock commented Oct 23, 2023

nathanielsimard commented Oct 23, 2023

L-M-Sherlock commented Oct 23, 2023 • edited Loading

nathanielsimard commented Oct 23, 2023

nathanielsimard commented Oct 23, 2023 • edited Loading

dae commented Oct 23, 2023

L-M-Sherlock commented Oct 24, 2023

dae commented Oct 24, 2023 • edited Loading

nathanielsimard commented Oct 24, 2023

dae commented Oct 25, 2023

nathanielsimard commented Oct 25, 2023

dae commented Oct 25, 2023

L-M-Sherlock commented Oct 26, 2023

L-M-Sherlock commented Oct 23, 2023 •

edited

Loading

nathanielsimard commented Oct 23, 2023 •

edited

Loading

dae commented Oct 24, 2023 •

edited

Loading