-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revelator darkmod AVX* changes #631
base: master
Are you sure you want to change the base?
Conversation
Adds some additional math used for shadow generation / interaction using Advanced Vector Extensions if Cpu supports it falls back to generic or SSE if not.
A good deal of formatting differences make these a bit larger than they have to be but i cannot do much about that. |
neo/idlib/math/Simd_AVX.cpp
Outdated
return "MMX & SSE & SSE2 & SSE3 & AVX"; | ||
} | ||
|
||
#elif defined(_MSC_VER) && defined(_M_IX86) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this code (and the AVX2 code) only enabled when using MSVC (VisualC++) on 32bit(!) x86?
Did you do any benchmarks to see how much this improves performance? |
ah damn forgot to add the enums for the new cpu types (slaps head) :/ just forgot to add them for the other targets in the middle of correcting that :) i did do a simple benchmark with vsync off in sikkmod with all the gruesome applied i got 130 fps in 4k. |
i dont currently have a linux box on hand to test for mistakes but i guess the buildbot in git does have one. |
forgot these in the last pull request doh...
and how much without these changes? |
@@ -332,7 +327,9 @@ void TestAdd( void ) { | |||
} | |||
idLib::common->Printf( "====================================\n" ); | |||
|
|||
// ====================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of all these changes?
neo/idlib/math/Simd_AVX.cpp
Outdated
@@ -35,19 +35,6 @@ If you have questions concerning this license or the applicable additional terms | |||
// | |||
//=============================================================== | |||
|
|||
#if defined(__GNUC__) && defined(__SSE3__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing all this probably won't work either, esp. on non-x86 CPUs
ALIGNTYPE16
hmm compiles here on x64 atleast on msvc but try it on a linux build and let me know if it fails. forgot to add the type for ALIGNTYPE16 its basically the same as ALIGN16 with a small difference
for msvc
for gnu and probably other unixes. without the avx code changes sikkmod could barely reach above 60 fps with the heaviest settings (was playable but it did show) |
happy i did not also add the GLSL renderer yet im a bit rusty in handling pull requests and coding in general, been retired for some time. Lets hope the last change clears up the build errors. :) |
looking at it now i see i could just have written it as |
I think these "inlining failed in call to ‘always_inline’ ‘int _mm256_movemask_ps(__m256)’: target specific option mismatch" errors mean that the source file it's in needs to be built with the corresponding CPU extension enabled, so Simd_AVX.cpp (and only that source file!) must be built with |
#define ID_MAX_ALLOCA_SIZE 1048576 // 1MB | ||
// Linux has a 8MB stack by default, and so does macOS, at least for the main thread | ||
// anyway, a 2MB limit alloca should be safe even when using it multiple times in the same function | ||
#define ID_MAX_ALLOCA_SIZE 2097152 // 2MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a specific reason you increased this to 2MB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont remember changing that ? maybe something from TDM carried over :S
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no just checked this flag does not exist there and neither anywhere else...
no idea tbh.
I recommand using the |
From a TDM perspective, I think the bulk of the AVX changes happened in 2.06 and 2.07 but it's hard to discern how much performance impact they had since we also went "Multi-Core" ( SMP ) with game code vs the renderer in 2.06. I can say that in 2.06 the 32-bit executable was at least 25% slower than the 64-bit because we hadn't yet ported SIMD optimizations to 32-bit. Though I think 32-bit lost about 10% performance by virtue of going to a Dhewm3 codebase with legacy ( Pentium 4 ) SIMD features removed. ( I know that AMD 64 players instantly pounced on 2.06 as a long awaited vindication that their hardware was always better than the Pentium 4.) So I guess if this is successfully back-ported, we should expect a 15 to 20% performance depending on the scene. ( This presumes that other engine optimizations like our new light culling system (etc), SMP, vertex buffer changes, aren't also included in here somewhere. ). Maybe Rev will open a new PR to submit our asset loading optimizations? ( SIMD vectorized mipmap generation was one of the major wins in TDM 2.10 for loading time improvements. ) |
more optimizations from TDM are definitely welcome, preferably with one PR per feature :) |
inline problem with linux and mac intrinsics i see argh... definatly need some more work for those but atleast the windows builds seem to go through now. |
ahh yes need to add the corresponding instruction set with gcc / clang. |
By the way, you can also test building with GCC on Windows by using MinGW32-w64. |
i do have msys2 with the mingw compilers on another pc. |
calling it a night for now, its getting quite late here in denmark :) |
to late actually i have to go to the dentist in an hour xD. reverting some macros for the SSE simd instructions so that it checks for the SSE flag and adding a few new ones for the AVX* ones checking for the AVX and AVX2 flag on gcc / clang since they are the ones making trouble atm. for avx / avx2 the addition of -mavx -mavx2 might also be needed for gcc / clang i reckon unless checking for those flags are enough. ill try a test build with gcc when i get back :) |
Except for shader gamma it works just fine with SDL2, might become a problem in SDL3 though unless gamma code and shader is added to the GLSL interaction routine.
optimized VBO code from michael hiney (InsideQC)
I'd prefer to have the GLSL code in a separate PR |
Added SSE intrinsics to R_LocalPointToGlobal and myGlMultiMatrix and changed name of myGlMultiMatrix to R_MatrixMultiply so it does'nt feel out of place.
neo/renderer/tr_main.cpp
Outdated
|
||
/* fast path use sse */ | ||
if ( cpuid & CPUID_SSE ) { | ||
__m128 B0x = _mm_loadu_ps( &b[0] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't even compile on non-x86 (or x86_64) machines, so it must be behind some kind of #if
or #ifdef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be fixed now havent comitted the new changes yet though (different machine so i need to setup a few things).
it does compile with mingw-w64 and clang now but i noticed i forgot to change the SSE2 macro to SSE which caused it to fail.
might need another guard for unix / mac lets see.
|
||
for ( i = 0 ; i < numPlanes ; i++ ) { | ||
frust = planes + i; | ||
d = frust->Distance( worldOrigin ); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't commit all these unrelated whitespace changes. use git gui
or similar to only commit the relevant changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry been hammering away at getting it to compile on gcc / clang.
mingw-w64 builds work now as well as windows clang builds.
there was one sorry bugger with current mingw-w64-abi the __cpuid function now uses the same syntax as msvc's so i had to change the #if defined(GNUC) && (defined(i386) || defined (x86_64)) guard into one specific for unix so the mingw builds dont use that one.
well atleast it was informal i reckon id make a new repo after getting the last build bugs out and start slowly adding in the changes, that way we can avoid all the whitespace changes from me mucking around in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now lets use it as a draft to iron out build bugs on different arch's, bit easier than having multiple operating systems
:) when it is all up and running ill retire the fork and start from a fresh one adding only the changes nessesary without the formatting. also gives you the option to chime in before we make changes that may make or break something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok seems to build on ubuntu now atleast in the one in wsl.
running it on wsl does not seem to work though the game window opens it crashes shortly after with a crash in imgui (not sure wsl is the best testing environment).
i packed up the binaries here https://sourceforge.net/projects/cbadvanced/files/Game%20Projects/dhewm3-ubuntu.7z/download
could you try them out on a real linux box maybe ?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://sourceforge.net/projects/cbadvanced/files/Game%20Projects/dhewm3-git.tar.zst/download
my current source code ignore whitespace it is not ment to be merged as is only for testing if it works :).
https://sourceforge.net/projects/cbadvanced/files/Game%20Projects/dhewm3.tar.zst/download
i forgot linux sets some flags that are probably lost when zipping it with a windows tool so heres a tarball of the game executables made in ubuntu, i used the linux dist dir in the source as install target (not sure if correct ?).
i added some test builds made with mingw-w64 and clang here -> loads of dependencies due to msys2 wanting to have everything linked to dll's xD |
do not commit!
do not commit only for testing build errors
thought i allready uploaded that one...
ok most whitespace changes are now gone in my local build tree. one question though __cpuid on mac does something odd to the cpu registers i guess that was why your version exchanged the ebx value for esi, i had to revert it partially because even though the version from darkmod worked in linux it broke when trying to compile it for mac. i hope to god once i uploaded the change to cpu.cpp that all build error's are gone so i can start seperating the different changes in a new fork. |
void R_LocalPointToGlobal( const float modelMatrix[16], const idVec3 &in, idVec3 &out ) { | ||
#if !( defined(MACOS_X) || defined(__APPLE__) ) | ||
#if !( defined(__ppc__) || defined(MACOS_X) || defined(__APPLE__) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a very good check for this purpose. It doesn't do what you want on
- x86 Macs (there it should be enabled)
- non-x86 CPUs that aren't PPC (unless they happen to run OSX)
Can't you just use #ifdef __SSE__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i newer coded for MAC before so was a bit unsure how to proceed :) ill see if just using SSE helps.
|
||
// Revelator: these work whether in gcc clang or msvc in x86 or x64 (no inline assembly used) | ||
#if defined(_MSC_VER) && ( defined(_M_X64) || defined(_M_IX86) ) || \ | ||
defined(__GNUC__) && ( defined(__i386__) || defined (__x86_64__) ) && defined(__AVX__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With GCC, this currently disabled the code because __AVX__
is only defined if -mavx
is used - which would have to be set in CMakeLists.txt, only for this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I guess __AVX__
is only set if __i386__
or __x86_64__
is set, so those checks aren't necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i been working on some cmake changes to probe the users cpu and set the correct flags for gnu, still not ready but ill post the change when it is.
but is it only avx that needs a flag set or do we also need it for every SSE type ? like -sse -sse2 -sse3 -ssse3 -mavx -mavx2 etc. if so that would suck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i been working on some cmake changes to probe the users cpu and set the correct flags for gnu
that is not the right approach - we want binaries that work on all x86 (_64) systems supporting at least SSE or SSE2 or whatever the current baseline is.
so almost all code of dhewm3 should be compiled just like before - only the Simd_FOO.cpp
files should get their corresponding -mfoo
flag.
That's what I meant with "only for this file".
At least for x86_64 (x64), SSE and SSE2 are always available, so -msse
and -msse2
aren't needed (and neither -mmmx
).
For SSE3 it would be needed in theory, but in practice Simd_SSE3.cpp uses MSVC-specific inline-assembly that doesn't work with GCC anyway - at least the version currently in dhewm3, don't know if TDM has switched that code to use the more portable intrinsics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe this could also be handled in the source files instead of CMake, like
// add this at the beginning of the source file, under the big comment block with the license
#if defined(__i386__) || defined (__x86_64__)
#ifdef __GNUC__
#pragma GCC target ("avx")
#endif
#endif // // x86 or x86_64
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#index-target-function-attribute-5 has a list of valid target arguments
not sure if Clang supports this GCC extension, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree i just thought you keept the SDL1 code for posterity or possibly some ancient machine ;)
guess it should be removed at some point.
hmm SDL_win32_main.cpp also only seems to be pulled in if you specifically request SDL1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i foresee one problem atleast with my hybrid glsl/arb renderer since SDL3 no longer supports the old hardware gamma we would need shader code for the glsl interactions for gamma as well when moving on to SDL3.
but lets get the intrinsics working first :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm SDL_win32_main.cpp also only seems to be pulled in if you specifically request SDL1.
oh right, now I remember.
yes, that atexit(cleanup_output);
line can be deleted, cleanup_output()
has been moved to win_main.cpp and atexit(cleanup_output);
is now called from SDL_main()
in win_main.cpp (for all SDL versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I want GLSL at all at this point, but if it's added it definitely needs to support shader gamma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally would not be a problem but because of the way it is loaded directly in RB_ARB2_DrawInteraction it causes problems for my renderer because it does the housekeeping itself if told this one goes to arb then it goes to the arb side and vice versa but since it is loaded directly in the interaction renderer it bonks out if glsl is selected.
as an example i can run sikkmod with glsl interactions and still use all the arb effects :)
I created a short testcase for D3_PUSH_X86_TARGETS(): https://gist.github.com/DanielGibson/4e35947c548523e2ac42047a76df9239 it needs to be implemented like in that file, the code I suggested earlier didn't work |
processorString[9] = pstring[9]; | ||
processorString[10] = pstring[10]; | ||
processorString[11] = pstring[11]; | ||
processorString[12] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of copying pstring to processorString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure either it's from darkmod stgatilov might know ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null terminating pstring ? but there are better ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry had to get some shuteye it was allmost 2.00 in the night here :)
not a young man anymore sadly xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for terminating some say 0 is sufficient if using C strings though i hear quite a lot prefer the special '\0' character.
tbh in a c++ project std::string would probably have been better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how a std::string would help here, but for 0-termination pstring[12] = 0;
should suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this code is directly from the original Doom3 GPL release: https://github.com/TTimo/doom3.gpl/blob/aaa855815ab484d5bd095f347163194ac569abcc/neo/sys/win32/win_cpu.cpp#L199-L223
still no idea why they did that, and I also think that it doesn't really matter if the users CPU is from AMD (original Doom3 used that to decide what clockrate a CPU needs to qualify for medium/high quality, but dhwem3 doesn't do that. https://github.com/TTimo/doom3.gpl/blob/aaa855815ab484d5bd095f347163194ac569abcc/neo/framework/Common.cpp#L2746-L2749)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm could be right i admit i did not look at the original code release.
not sure myself why it was done like that myself.
yeah it probably makes less sense with the check today seing as ryzen supports anything intel does, not even sure if needed at the time of 3DNow since that was more like an extension of the MMX model.
probably could be used for determining unsupported cpu types but since there are few (none) of those left and the ones left dont support playing doom3 anyway it might be prudent to remove it unless users actually want to look at cpu information in the doom3 console :)
Patch for CMakeLists.txt, so it uses (demands) SSE2 on 32bit x86 when building with GCC/Clang (it was already the default for Visual C++): diff --git a/neo/CMakeLists.txt b/neo/CMakeLists.txt
index b9805579..c97dca69 100644
--- a/neo/CMakeLists.txt
+++ b/neo/CMakeLists.txt
@@ -331,9 +331,22 @@ if(D3_COMPILER_IS_GCC_OR_CLANG)
if(NOT CMAKE_CROSSCOMPILING AND ONATIVE)
add_compile_options(-march=native)
elseif(NOT APPLE AND cpu STREQUAL "x86")
+ # TODO: set this at all? has worked for years so it should be safe, though now that
+ # we enable SSE2 by default, pentium4 would in theory be more suitable..
+ # however, according to the GCC manpage -march is kinda dangerous and the binaries
+ # may not work on any CPU besides the one specified..
add_compile_options(-march=pentium3)
endif()
+ if(cpu STREQUAL "x86")
+ # use SSE and SSE2 on 32bit x86 CPUs (on 64bit x86 CPUs they are used by default)
+ # NOTE: All our Windows release binaries of the last years have been built with VS2015
+ # or VS2017 which enable SSE2 by default for 32bit x86 and no one ever complained,
+ # so it seems like demanding SSE2 is fine (it's supported by Athlon64 and newer
+ # and Pentium4 and newer)
+ add_compile_options(-msse2)
+ endif()
+
if(FORCE_COLORED_OUTPUT)
if(CMAKE_COMPILER_IS_GNUCC)
add_compile_options (-fdiagnostics-color=always)
@@ -446,6 +459,12 @@ if(D3_COMPILER_IS_GCC_OR_CLANG)
set(sys_libs ${sys_libs} ${EXECINFO_LIBRARIES})
endif()
elseif(MSVC)
+
+ # NOTE: At least since VS2012, /arch:SSE2 is the default when targeting x86
+ # see https://learn.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2012/7t5yh4fd(v=vs.110)
+ # so don't set it here - that would be problematic anyway, because we can't detect the
+ # target CPU architecture when using MSVC, and /arch:SSE2 is invalid for ARM or even x64
+
add_compile_options(/MP) # parallel build (use all cores, or as many as configured in VS)
add_compile_options(/W3) # TODO: was /W4, caused trouble with VS2019 (and/or its integrated CMake? or only HarrieVG's setup?) |
changes for adding DarkMods AVX* based culling and cpu detection as well as the changes nessesary to use them in the renderer.