Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Doris be restarts every day from 0 am to 1 am #46432

Open
2 of 3 tasks
Ak47biubiuaaa opened this issue Jan 6, 2025 · 4 comments
Open
2 of 3 tasks

[Bug] Doris be restarts every day from 0 am to 1 am #46432

Ak47biubiuaaa opened this issue Jan 6, 2025 · 4 comments

Comments

@Ak47biubiuaaa
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Version

2.1.7

What's Wrong?

I have three BE, but every day at 0 am, one of BE has restarts.
Doris be restarts every day from 0 am to 1 am. The log is as follows, and no abnormal and out-of-memory monitoring information is found:
Restarts occur when I20250106 00:01:03.237107
log:
I20250106 00:01:02.727793 8210 query_context.cpp:188] Query bfc4d656ef82409a-b590bdb78c6f7de3 deconstructed, , deregister query/load memory tracker, queryId=bfc4d656ef82409a-b590bdb78c6f7de3, Limit=2.00 GB, CurrUsed=35.25 KB, PeakUsed=8.94 MB
I20250106 00:01:02.735662 7699 fragment_mgr.cpp:778] query_id: bbec01208ed45d5-af4e282d731740d7, coord_addr: TNetworkAddress(hostname=10.16.10.184, port=9020), total fragment num on current host: 4, fe process uuid: 1735867264753, query type: SELECT, report audit fe:TNetworkAddress(hostname=10.16.10.184, port=9020)
I20250106 00:01:02.735759 7699 fragment_mgr.cpp:819] Query/load id: bbec01208ed45d5-af4e282d731740d7, use workload group: TG[id = 1, name = normal, cpu_share = 1024, memory_limit = 8.44 GB, enable_memory_overcommit = true, version = 0, cpu_hard_limit = -1, scan_thread_num = 48, max_remote_scan_thread_num = 512, min_remote_scan_thread_num = 8, spill_low_watermark=50, spill_high_watermark=80, is_shutdown=false, query_num=3, read_bytes_per_second=-1, remote_read_bytes_per_second=-1], is pipeline: 1
I20250106 00:01:02.735778 7699 fragment_mgr.cpp:830] Register query/load memory tracker, query/load id: bbec01208ed45d5-af4e282d731740d7 limit: 0
I20250106 00:01:02.735824 7699 pipeline_x_fragment_context.cpp:207] PipelineXFragmentContext::prepare|query_id=bbec01208ed45d5-af4e282d731740d7|fragment_id=0|pthread_id=140121769277184
I20250106 00:01:02.742755 8209 fragment_mgr.cpp:730] Removing query bbec01208ed45d5-af4e282d731740d7 instance bbec01208ed45d5-af4e282d731740d8, all done? true
I20250106 00:01:02.742787 8209 fragment_mgr.cpp:730] Removing query bbec01208ed45d5-af4e282d731740d7 instance bbec01208ed45d5-af4e282d731740d9, all done? true
I20250106 00:01:02.742794 8209 fragment_mgr.cpp:730] Removing query bbec01208ed45d5-af4e282d731740d7 instance bbec01208ed45d5-af4e282d731740da, all done? true
I20250106 00:01:02.742800 8209 fragment_mgr.cpp:730] Removing query bbec01208ed45d5-af4e282d731740d7 instance bbec01208ed45d5-af4e282d731740db, all done? true
I20250106 00:01:02.742810 8209 fragment_mgr.cpp:736] Query bbec01208ed45d5-af4e282d731740d7 finished
I20250106 00:01:02.743234 8209 query_context.cpp:156] Query bbec01208ed45d5-af4e282d731740d7 deconstructed, , deregister query/load memory tracker, queryId=bbec01208ed45d5-af4e282d731740d7, Limit=2.00 GB, CurrUsed=453.50 KB, PeakUsed=9.72 MB
I20250106 00:01:02.743286 8209 query_context.cpp:188] Query bbec01208ed45d5-af4e282d731740d7 deconstructed, , deregister query/load memory tracker, queryId=bbec01208ed45d5-af4e282d731740d7, Limit=2.00 GB, CurrUsed=453.50 KB, PeakUsed=9.72 MB
I20250106 00:01:02.746565 7678 fragment_mgr.cpp:778] query_id: 20017784664247c2-855d4c4d960c8abc, coord_addr: TNetworkAddress(hostname=10.16.10.184, port=9020), total fragment num on current host: 4, fe process uuid: 1735867264753, query type: SELECT, report audit fe:TNetworkAddress(hostname=10.16.10.184, port=9020)
I20250106 00:01:02.746655 7678 fragment_mgr.cpp:819] Query/load id: 20017784664247c2-855d4c4d960c8abc, use workload group: TG[id = 1, name = normal, cpu_share = 1024, memory_limit = 8.44 GB, enable_memory_overcommit = true, version = 0, cpu_hard_limit = -1, scan_thread_num = 48, max_remote_scan_thread_num = 512, min_remote_scan_thread_num = 8, spill_low_watermark=50, spill_high_watermark=80, is_shutdown=false, query_num=3, read_bytes_per_second=-1, remote_read_bytes_per_second=-1], is pipeline: 1
I20250106 00:01:02.746672 7678 fragment_mgr.cpp:830] Register query/load memory tracker, query/load id: 20017784664247c2-855d4c4d960c8abc limit: 0
I20250106 00:01:02.746686 7678 pipeline_x_fragment_context.cpp:207] PipelineXFragmentContext::prepare|query_id=20017784664247c2-855d4c4d960c8abc|fragment_id=0|pthread_id=140121945523968
I20250106 00:01:02.761847 8207 fragment_mgr.cpp:730] Removing query 20017784664247c2-855d4c4d960c8abc instance 20017784664247c2-855d4c4d960c8abd, all done? true
I20250106 00:01:02.761895 8207 fragment_mgr.cpp:730] Removing query 20017784664247c2-855d4c4d960c8abc instance 20017784664247c2-855d4c4d960c8abe, all done? true
I20250106 00:01:02.761906 8207 fragment_mgr.cpp:730] Removing query 20017784664247c2-855d4c4d960c8abc instance 20017784664247c2-855d4c4d960c8abf, all done? true
I20250106 00:01:02.761912 8207 fragment_mgr.cpp:730] Removing query 20017784664247c2-855d4c4d960c8abc instance 20017784664247c2-855d4c4d960c8ac0, all done? true
I20250106 00:01:02.761919 8207 fragment_mgr.cpp:736] Query 20017784664247c2-855d4c4d960c8abc finished
I20250106 00:01:02.762681 8207 query_context.cpp:156] Query 20017784664247c2-855d4c4d960c8abc deconstructed, , deregister query/load memory tracker, queryId=20017784664247c2-855d4c4d960c8abc, Limit=2.00 GB, CurrUsed=81.38 KB, PeakUsed=8.32 MB
I20250106 00:01:02.762732 8207 query_context.cpp:188] Query 20017784664247c2-855d4c4d960c8abc deconstructed, , deregister query/load memory tracker, queryId=20017784664247c2-855d4c4d960c8abc, Limit=2.00 GB, CurrUsed=81.38 KB, PeakUsed=8.32 MB
I20250106 00:01:03.237107 8165 daemon.cpp:221] os physical memory 31.26 GB. process memory used 3.28 GB(= 3.69 GB[vm/rss] - 415.25 MB[tc/jemalloc_cache] + 0[reserved] + 0B[waiting_refresh]), limit 28.13 GB, soft limit 25.32 GB. sys available memory 22.96 GB(= 22.96 GB[proc/available] - 0[reserved] - 0B[waiting_refresh]), low water mark 1.56 GB, warning water mark 3.13 GB.
I20250106 00:01:09.373319 29186 doris_main.cpp:382] version doris-2.1.7-rc03(AVX2) RELEASE (build git://vm-36@443e87e)
Built on Wed, 06 Nov 2024 15:34:46 CST by vm-36
I20250106 00:01:11.235632 29186 doris_main.cpp:490] Doris backend JNI is initialized.
I20250106 00:01:11.236301 29186 mem_info.cpp:361] Physical Memory: 33565720576, BE Available Physical Memory(consider cgroup): 33565720576, Mem Limit: 28.13 GB, origin config value: 90%, System Mem Available Min Reserve: 1.56 GB, Vm Min Free KBytes: 66.00 MB, Vm Overcommit Memory: 0
I20250106 00:01:11.236337 29186 doris_main.cpp:508] Cpu Info:
Model: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Cores: 8
Max Possible Cores: 8
L1 Cache: 32.00 KB (Line: 64.00 B)
L2 Cache: 1.00 MB (Line: 64.00 B)
L3 Cache: 30.25 MB (Line: 64.00 B)
Hardware Supports:
ssse3
sse4_1
sse4_2
popcnt
avx
avx2
Numa Nodes: 1
Numa Nodes of Cores: 0->0 | 1->0 | 2->0 | 3->0 | 4->0 | 5->0 | 6->0 | 7->0 |

What You Expected?

Locating the cause of the anomaly

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@XLPE
Copy link
Contributor

XLPE commented Jan 6, 2025

You can check the be.out log

@Ak47biubiuaaa
Copy link
Author

You can check the be.out log

*** Query id: 77d107fac80347d9-b07bc57eb6d9ea79 ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1736092862 (unix time) try "date -d @1736092862" if you are using GNU date ***
*** Current BE git commitID: 443e87e ***
*** SIGSEGV address not mapped to object (@0x2e0) received by PID 6716 (TID 8204 OR 0x7f6fbc6c3700) from PID 736; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421
1# os::Linux::chained_handler(int, siginfo_t*, void*) in /opt/doris/java8/jre/lib/amd64/server/libjvm.so
2# JVM_handle_linux_signal in /opt/doris/java8/jre/lib/amd64/server/libjvm.so
3# signalHandler(int, siginfo_t*, void*) in /opt/doris/java8/jre/lib/amd64/server/libjvm.so
4# 0x00007F72D8A89400 in /lib64/libc.so.6
5# doris::vectorized::ColumnStr::insert_indices_from(doris::vectorized::IColumn const&, unsigned int const*, unsigned int const*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/columns/column_string.cpp:203
6# doris::vectorized::ColumnNullable::insert_indices_from(doris::vectorized::IColumn const&, unsigned int const*, unsigned int const*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/columns/column_nullable.cpp:309
7# doris::vectorized::MutableBlock::add_rows(doris::vectorized::Block const*, unsigned int const*, unsigned int const*, std::vector<int, std::allocator > const*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/core/block.cpp:1014
8# doris::vectorized::BlockSerializerdoris::pipeline::ExchangeSinkLocalState::next_serialized_block(doris::vectorized::Block*, doris::PBlock*, int, bool*, bool, std::vector<unsigned int, std::allocator > const*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/vdata_stream_sender.cpp:888
9# doris::vectorized::PipChanneldoris::pipeline::ExchangeSinkLocalState::add_rows(doris::vectorized::Block*, std::vector<unsigned int, std::allocator > const&, bool) at /home/zcp/repo_center/doris_release/doris/be/src/vec/sink/vdata_stream_sender.h:539
10# doris::Status doris::pipeline::ExchangeSinkOperatorX::channel_add_rows_with_idx<std::vector<doris::vectorized::PipChanneldoris::pipeline::ExchangeSinkLocalState, std::allocator<doris::vectorized::PipChanneldoris::pipeline::ExchangeSinkLocalState> > >(doris::RuntimeState*, std::vector<doris::vectorized::PipChanneldoris::pipeline::ExchangeSinkLocalState, std::allocator<doris::vectorized::PipChanneldoris::pipeline::ExchangeSinkLocalState> >&, int, std::vector<std::vector<unsigned int, std::allocator >, std::allocator<std::vector<unsigned int, std::allocator > > >&, doris::vectorized::Block*, bool) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/exec/exchange_sink_operator.cpp:713
11# doris::pipeline::ExchangeSinkOperatorX::sink(doris::RuntimeState*, doris::vectorized::Block*, bool) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/exec/exchange_sink_operator.cpp:586
12# doris::pipeline::PipelineXTask::execute(bool*) in /opt/doris/be/lib/doris_be
13# doris::pipeline::TaskScheduler::_do_work(unsigned long) at /home/zcp/repo_center/doris_release/doris/be/src/pipeline/task_scheduler.cpp:347
14# doris::ThreadPool::dispatch_thread() in /opt/doris/be/lib/doris_be
15# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
16# start_thread in /lib64/libpthread.so.0
17# clone in /lib64/libc.so.6

StdoutLogger 2025-01-06 00:01:09,215 Start time: Mon Jan 6 00:01:09 CST 2025
INFO: java_cmd /opt/doris/java8/bin/java
INFO: jdk_version 8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/doris/be/lib/java_extensions/preload-extensions/preload-extensions-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/doris/be/lib/java_extensions/java-udf/java-udf-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/doris/be/lib/hadoop_hdfs/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]

@Ak47biubiuaaa
Copy link
Author

You can check the be.out log

I find a SQL exec error : (1105, 'errCode = 2, detailMessage = Backend process epoch changed, previous 1736012512920 now 1736092876420, means this be has already restarted, should cancel this coordinator,query id 77d107fac80347d9-b07bc57eb6d9ea79'). then retry succeeded after 30 seconds for this SQL.

@Ak47biubiuaaa
Copy link
Author

Ak47biubiuaaa commented Jan 6, 2025

The same SQL and scheduled tasks did not have an exception in version 2.1.6, and this exception has only appeared since the upgrade to 2.1.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants