Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/early stop #109

Merged
merged 4 commits into from
Oct 24, 2023
Merged

Feat/early stop #109

merged 4 commits into from
Oct 24, 2023

Conversation

L-M-Sherlock
Copy link
Member

Waiting for benchmarks.

@L-M-Sherlock L-M-Sherlock added the enhancement New feature or request label Oct 23, 2023
@L-M-Sherlock
Copy link
Member Author

Without .metric_valid_numeric(LossMetric::new()), it will disable the early stop:

2023-10-23T10:07:13.474593Z  WARN burn_train::learner::early_stopping: Can't find metric for early stopping.    

However, it induces panic if I add .metric_valid_numeric(LossMetric::new()):

thread '<unnamed>' panicked at /Users/jarrettye/.cargo/git/checkouts/burn-acfbee6a141c1b41/d263968/burn-train/src/logger/file.rs:26:14:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /Users/jarrettye/.cargo/git/checkouts/burn-acfbee6a141c1b41/d263968/burn-train/src/metric/store/client.rs:35:56:
called `Result::unwrap()` on an `Err` value: SendError { .. }
thread '<unnamed>' panicked at /Users/jarrettye/.cargo/git/checkouts/burn-acfbee6a141c1b41/d263968/burn-train/src/metric/store/client.rs:142:40:
called `Result::unwrap()` on an `Err` value: SendError { .. }
stack backtrace:
   0:        0x100f3c698 - std::backtrace_rs::backtrace::libunwind::trace::h815672f5996c9e34
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:        0x100f3c698 - std::backtrace_rs::backtrace::trace_unsynchronized::h2e094b2d8c270437
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x100f3c698 - std::sys_common::backtrace::_print_fmt::hae24b71afea95695
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x100f3c698 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h6d4268b2ed62fb94
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x100f5b0b4 - core::fmt::rt::Argument::fmt::h835101b6de0df17f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:        0x100f5b0b4 - core::fmt::write::h5d55d44549819258
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:        0x100f3967c - std::io::Write::write_fmt::h1da98de0250c868b
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:        0x100f3c4d8 - std::sys_common::backtrace::_print::h5844dab5cfa39fca
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x100f3c4d8 - std::sys_common::backtrace::print::h2c300c1ebedfc73c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x100f3e0e0 - std::panicking::default_hook::{{closure}}::h0aa9be5c44269370
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:        0x100f3dd68 - std::panicking::default_hook::h2c0ef097934ee9e6
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:287:9
  11:        0x100d34f40 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hb681ba0e6970f985
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  12:        0x100d34f40 - test::test_main::{{closure}}::h139556dd544e4d62
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/test/src/lib.rs:136:21
  13:        0x100f3e760 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h9b865254accc03e0
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  14:        0x100f3e760 - std::panicking::rust_panic_with_hook::h84c8637cb6e56008
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:711:13
  15:        0x100f3e534 - std::panicking::begin_panic_handler::{{closure}}::h25482adda06c7b7f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:599:13
  16:        0x100f3cb24 - std::sys_common::backtrace::__rust_end_short_backtrace::h0c6f3beb22324a29
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  17:        0x100f3e298 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  18:        0x100f7b978 - core::panicking::panic_fmt::h9072a0246ecafd14
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
  19:        0x100f7bca0 - core::result::unwrap_failed::hd7600d03a3086be4
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
  20:        0x100d677c4 - <burn_train::metric::store::client::EventStoreClient as core::ops::drop::Drop>::drop::hac3de293bc98f87c
  21:        0x100cc20c0 - alloc::sync::Arc<T,A>::drop_slow::h4ad631dee4a040b0
  22:        0x100ceeed0 - burn_train::learner::train_val::<impl burn_train::learner::base::Learner<LC>>::fit::h68ffa49241ff58b3
  23:        0x100cca358 - fsrs::training::train::h5bb166eb7e87e2a1
  24:        0x100cd66a4 - core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut::hdc95c68496afae6a
  25:        0x100d098cc - rayon::iter::plumbing::Folder::consume_iter::h212f6420163efdbc
  26:        0x100d08eec - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  27:        0x100c09980 - rayon_core::join::join_context::{{closure}}::h6a27a073252e59e7
  28:        0x100d08f9c - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  29:        0x100c56fd0 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb04848c2d7031caa
  30:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  31:        0x100c09364 - rayon_core::join::join_context::{{closure}}::h5a47a8a9661a6cda
  32:        0x100d08bf4 - rayon::iter::plumbing::bridge_producer_consumer::helper::h6742b0a3a0de3f6b
  33:        0x100c56da8 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hae76227955ff1fe8
  34:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  35:        0x100f13854 - rayon_core::registry::ThreadBuilder::run::hac19021355221cc4
  36:        0x100f15b74 - std::sys_common::backtrace::__rust_begin_short_backtrace::h12f5dea2e54a2c2d
  37:        0x100f18b1c - core::ops::function::FnOnce::call_once{{vtable.shim}}::hac809be33ae58ba8
  38:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hee56534d4cc78b31
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  39:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hf1a328a6507f3fa7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  40:        0x100f42344 - std::sys::unix::thread::Thread::new::thread_start::h6244aa8f646b01ac
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  41:        0x1959a7fa8 - __pthread_joiner_wake
thread '<unnamed>' panicked at library/core/src/panicking.rs:126:5:
panic in a function that cannot unwind
stack backtrace:
   0:        0x100f3c698 - std::backtrace_rs::backtrace::libunwind::trace::h815672f5996c9e34
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:        0x100f3c698 - std::backtrace_rs::backtrace::trace_unsynchronized::h2e094b2d8c270437
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x100f3c698 - std::sys_common::backtrace::_print_fmt::hae24b71afea95695
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x100f3c698 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h6d4268b2ed62fb94
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x100f5b0b4 - core::fmt::rt::Argument::fmt::h835101b6de0df17f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:        0x100f5b0b4 - core::fmt::write::h5d55d44549819258
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:        0x100f39768 - std::io::Write::write_fmt::hc515897f91abd6cf
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:        0x100f3c4d8 - std::sys_common::backtrace::_print::h5844dab5cfa39fca
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x100f3c4d8 - std::sys_common::backtrace::print::h2c300c1ebedfc73c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x100f3e0e0 - std::panicking::default_hook::{{closure}}::h0aa9be5c44269370
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:        0x100f3de0c - std::panicking::default_hook::h2c0ef097934ee9e6
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:290:9
  11:        0x100d34f40 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hb681ba0e6970f985
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  12:        0x100d34f40 - test::test_main::{{closure}}::h139556dd544e4d62
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/test/src/lib.rs:136:21
  13:        0x100f3e760 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h9b865254accc03e0
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2021:9
  14:        0x100f3e760 - std::panicking::rust_panic_with_hook::h84c8637cb6e56008
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:711:13
  15:        0x100f3e4f8 - std::panicking::begin_panic_handler::{{closure}}::h25482adda06c7b7f
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:597:13
  16:        0x100f3cb24 - std::sys_common::backtrace::__rust_end_short_backtrace::h0c6f3beb22324a29
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  17:        0x100f3e298 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  18:        0x100f7b9a8 - core::panicking::panic_nounwind_fmt::ha4f6bbf8fe9fb5ae
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:96:14
  19:        0x100f7ba24 - core::panicking::panic_nounwind::ha756411eedbc9594
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:126:5
  20:        0x100f7ba94 - core::panicking::panic_cannot_unwind::h496677c8c27b4397
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:189:5
  21:        0x100ceeedc - burn_train::learner::train_val::<impl burn_train::learner::base::Learner<LC>>::fit::h68ffa49241ff58b3
  22:        0x100cca358 - fsrs::training::train::h5bb166eb7e87e2a1
  23:        0x100cd66a4 - core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut::hdc95c68496afae6a
  24:        0x100d098cc - rayon::iter::plumbing::Folder::consume_iter::h212f6420163efdbc
  25:        0x100d08eec - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  26:        0x100c09980 - rayon_core::join::join_context::{{closure}}::h6a27a073252e59e7
  27:        0x100d08f9c - rayon::iter::plumbing::bridge_producer_consumer::helper::h8e4caaae11adfbdb
  28:        0x100c56fd0 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb04848c2d7031caa
  29:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  30:        0x100c09364 - rayon_core::join::join_context::{{closure}}::h5a47a8a9661a6cda
  31:        0x100d08bf4 - rayon::iter::plumbing::bridge_producer_consumer::helper::h6742b0a3a0de3f6b
  32:        0x100c56da8 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hae76227955ff1fe8
  33:        0x100f778a4 - rayon_core::registry::WorkerThread::wait_until_cold::hd933cbd892a425c5
  34:        0x100f13854 - rayon_core::registry::ThreadBuilder::run::hac19021355221cc4
  35:        0x100f15b74 - std::sys_common::backtrace::__rust_begin_short_backtrace::h12f5dea2e54a2c2d
  36:        0x100f18b1c - core::ops::function::FnOnce::call_once{{vtable.shim}}::hac809be33ae58ba8
  37:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hee56534d4cc78b31
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  38:        0x100f42344 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hf1a328a6507f3fa7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  39:        0x100f42344 - std::sys::unix::thread::Thread::new::thread_start::h6244aa8f646b01ac
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  40:        0x1959a7fa8 - __pthread_joiner_wake
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/Users/jarrettye/Codes/open-spaced-repetition/fsrs-rs/target/release/deps/fsrs-b7858974c2ace16a` (signal: 6, SIGABRT: process abort signal)

@dae, @nathanielsimard, could you help me?

@nathanielsimard
Copy link

@dae, @nathanielsimard, could you help me?

Can you verify that the file actually exists, early stopping is reading the file logged at {{artifact}}/valid/epoch-1/Loss.log. The error message is missing information, I'll try to improve that!

@L-M-Sherlock
Copy link
Member Author

L-M-Sherlock commented Oct 23, 2023

Can you verify that the file actually exists, early stopping is reading the file logged at {{artifact}}/valid/epoch-1/Loss.log.

To speed up the training, dae disabled the logger. I think it's unsafe to read file during training. And the file would also induce other errors. For example, our project trains the model concurrently. So the log file will be written by multiple loggers. It causes float number parsing error.

@nathanielsimard
Copy link

@L-M-Sherlock Early stopping involves using metrics collected through loggers with an event store. If you are using burn-train to collect metrics, there should be no concurrency errors, as each logger operates on its own thread and communicates through a channel. This setup makes it unlikely to slow down the training process. Parsing errors may occur, but they are generally only a concern if there is file corruption, which is rare since the file is always accessed by the same thread.

There is also an in-memory logger available, but it would consume a significant amount of RAM to store all events in-memory. However, it can be a viable option if your intention is to only use LossMetric.

I'm going to enable that option with the builder!

@nathanielsimard
Copy link

nathanielsimard commented Oct 23, 2023

@L-M-Sherlock

In fact you can already enable it with the following:

builder
        .metric_loggers(InMemoryMetricLogger::default(), InMemoryMetricLogger::default())

Let me know if it helps!

@dae
Copy link
Collaborator

dae commented Oct 23, 2023

@L-M-Sherlock we could probably solve the file corruption issue by using a separate artifact dir for each of the training runs we invoke, but I'd suggest we try the in-memory approach first, as I'd like to avoid file I/O if possible.

src/training.rs Outdated Show resolved Hide resolved
L-M-Sherlock added a commit to open-spaced-repetition/srs-benchmark that referenced this pull request Oct 24, 2023
@L-M-Sherlock
Copy link
Member Author

It improves the model slightly: open-spaced-repetition/srs-benchmark@665a423

@L-M-Sherlock L-M-Sherlock requested a review from dae October 24, 2023 02:46
@dae dae merged commit ce6a911 into main Oct 24, 2023
3 checks passed
@dae dae deleted the Feat/early-stop branch October 24, 2023 03:18
@dae
Copy link
Collaborator

dae commented Oct 24, 2023

Curiously, I get different results when I rebuild the result files on my machine:

dae@dtop:~/fsrs-benchmark% grep 'FSRS-rs ' ~/eval.1 # based on the files you checked in
FSRS-rs mean: 0.3851
FSRS-rs mean: 0.3324
FSRS-rs mean: 0.0577
dae@dtop:~/fsrs-benchmark% grep 'FSRS-rs ' ~/eval.2 # after removing the files and rebuilding them using latest burn-rs, THREADS=32 FSRS_RS=1
FSRS-rs mean: 0.3858
FSRS-rs mean: 0.3326
FSRS-rs mean: 0.0583

@nathanielsimard
Copy link

@dae If you use multiple threads to load the data, then the order in which the items are loaded isn't deterministic. It might explain the difference.

@dae
Copy link
Collaborator

dae commented Oct 25, 2023

Sorry, I probably shouldn't have mentioned THREADS - we use that for processing different files in parallel, so it shouldn't affect things. But it does appear we have some non-determinism somewhere in our code, as back-to-back runs are producing different values.

@nathanielsimard
Copy link

If you use multiple workers for the dataloader, you have non-determinism!

@dae
Copy link
Collaborator

dae commented Oct 25, 2023

We use a single worker for the dataloader. We're running multiple training runs in parallel, with separate data, and then determining the mean values of the outputs.

@L-M-Sherlock maybe it's caused by the use of a hashmap in filter_outlier?

@L-M-Sherlock
Copy link
Member Author

maybe it's caused by the use of a hashmap in filter_outlie

It is only used by the pretrain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants