-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASTC weights SIMD encoding #298
Open
ronanbel
wants to merge
1
commit into
BinomialLLC:master
Choose a base branch
from
ronanbel:ASTC_SIMD
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Thank you - this is great. I normally shy away from merging code that I can't easily maintain, but let me see what I can do. How much does this help encoding perf.? |
ssse3 (I5 6300) : 163 => 136 ms arm (A53) : 340 => 282 ms I moved the block weight transform code in a single function : pack_astc_block_weights you can enable/disable the SIMD code with a define BASISD_ASTC_SIMD All the simd code is annotated tested x86_64 on windows, compiled with VS2019 and clang 11 tested arm & arm64 on android, compiled with latest NDK (clang11) if needed, you can get in touch at : [email protected] [email protected] fix previous issue + optimize unpack
Hi Richard.
I did a second submit today (there was an error in the first one)
with additional optimization (uastc_unpack)
on my test device (arm a53), it lasts 200ms from 340ms (base code) to
transcode the uastc to astc (the same on an A57)
it's also faster on PC
I'm using a big texture for the tests, do a CRC32 at the end to detect if
the decoding gives the same result.
If you need some other optims or modifications, don't hesitate to let me
know.
And if you have a unit-test asset (a test.basis) which has all modes &
subsets & anchor patterns, it would help me to track my modifications are
valid
have a nice day
Le jeu. 12 mai 2022 à 09:00, Rich Geldreich ***@***.***> a
écrit :
… Thank you - this is great. I normally shy away from merging code that I
can't easily maintain, but let me see what I can do. How much does this
help encoding perf.?
—
Reply to this email directly, view it on GitHub
<#298 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOL7CCQAODQFKKYWU5QX2LVJSUBBANCNFSM5VIMSKDQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
re-hi (from home after a long day at work)
#define BASISD_ASTC_SIMD // 287 => 240
#define BASISD_UASTC_SIMD // 240 => 222 (231 without
BASISD_USE_UNALIGNED_WORD_READS)
#define BASISD_ASTC_24WRITE
here are the switches to enable most of my modifications (some like a hint
for the compiler to do a fast div3/5 are not between #if)
I'm using this version of the code on the project I'm working on, at least
on this dataset, all is fine.
I may do some other modifications (if needed, if I have some spare time
because I did most of this during the week-end)
I understand what you mean by "I can't easily maintain", I would do the
same.
All I can say is that I will continue to work in the game industry for at
least 15/20 years (probably not at Ubisoft), so I think you can ask me for
maintenance for the next decade :)
(anyway, you may leave this between #if defined and leave the user the
ability to use the simd version if he needs some speedup)
(there is not so much simd code, most added lines are here to setup tables
for further simd processing)
Le jeu. 12 mai 2022 à 20:06, ronan bel ***@***.***> a écrit :
… Hi Richard.
I did a second submit today (there was an error in the first one)
with additional optimization (uastc_unpack)
on my test device (arm a53), it lasts 200ms from 340ms (base code) to
transcode the uastc to astc (the same on an A57)
it's also faster on PC
I'm using a big texture for the tests, do a CRC32 at the end to detect if
the decoding gives the same result.
If you need some other optims or modifications, don't hesitate to let me
know.
And if you have a unit-test asset (a test.basis) which has all modes &
subsets & anchor patterns, it would help me to track my modifications are
valid
have a nice day
Le jeu. 12 mai 2022 à 09:00, Rich Geldreich ***@***.***> a
écrit :
> Thank you - this is great. I normally shy away from merging code that I
> can't easily maintain, but let me see what I can do. How much does this
> help encoding perf.?
>
> —
> Reply to this email directly, view it on GitHub
> <#298 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACOL7CCQAODQFKKYWU5QX2LVJSUBBANCNFSM5VIMSKDQ>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
How much does this help encoding perf
sorry I didn't answer your questions (12h day at work today, I'm tired)
I don't think it helps the encoder
it's only in the transcoder
one part in the uastc_unpack
one part in the transcoding to astc (the one I needed)
all is for the runtime client code (parsing uastc, then converting the astc)
Le jeu. 12 mai 2022 à 20:51, ronan bel ***@***.***> a écrit :
… re-hi (from home after a long day at work)
#define BASISD_ASTC_SIMD // 287 => 240
#define BASISD_UASTC_SIMD // 240 => 222 (231 without
BASISD_USE_UNALIGNED_WORD_READS)
#define BASISD_ASTC_24WRITE
here are the switches to enable most of my modifications (some like a hint
for the compiler to do a fast div3/5 are not between #if)
I'm using this version of the code on the project I'm working on, at least
on this dataset, all is fine.
I may do some other modifications (if needed, if I have some spare time
because I did most of this during the week-end)
I understand what you mean by "I can't easily maintain", I would do the
same.
All I can say is that I will continue to work in the game industry for at
least 15/20 years (probably not at Ubisoft), so I think you can ask me for
maintenance for the next decade :)
(anyway, you may leave this between #if defined and leave the user the
ability to use the simd version if he needs some speedup)
(there is not so much simd code, most added lines are here to setup tables
for further simd processing)
Le jeu. 12 mai 2022 à 20:06, ronan bel ***@***.***> a écrit :
> Hi Richard.
> I did a second submit today (there was an error in the first one)
> with additional optimization (uastc_unpack)
> on my test device (arm a53), it lasts 200ms from 340ms (base code) to
> transcode the uastc to astc (the same on an A57)
> it's also faster on PC
> I'm using a big texture for the tests, do a CRC32 at the end to detect if
> the decoding gives the same result.
> If you need some other optims or modifications, don't hesitate to let me
> know.
> And if you have a unit-test asset (a test.basis) which has all modes &
> subsets & anchor patterns, it would help me to track my modifications are
> valid
>
> have a nice day
>
>
> Le jeu. 12 mai 2022 à 09:00, Rich Geldreich ***@***.***> a
> écrit :
>
>> Thank you - this is great. I normally shy away from merging code that I
>> can't easily maintain, but let me see what I can do. How much does this
>> help encoding perf.?
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#298 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACOL7CCQAODQFKKYWU5QX2LVJSUBBANCNFSM5VIMSKDQ>
>> .
>> You are receiving this because you authored the thread.Message ID:
>> ***@***.***>
>>
>
|
Akaricchi
reviewed
Apr 10, 2023
Comment on lines
+12580
to
+12581
uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), bitNum0 ); // bitMask = (1U << n) - 1U | ||
uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), bitNum1 ); // bitMask = (1U << n) - 1U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails on GCC without -flax-vector-conversions
:
../subprojects/basis_universal/transcoder/basisu_transcoder.cpp: In function 'void basist::pack_astc_block_weights(uint8_t*, const uint8_t*, int, int)':
../subprojects/basis_universal/transcoder/basisu_transcoder.cpp:12010:80: note: use '-flax-vector-conversions' to permit conversions between vectors with differing element types or numbers of subparts
12010 | uint8x8_t rev8 = vqmovn_u16( vcombine_u16( rev8lohi, vdup_n_u8(0) ) ); // 8bits in 4 u8 (clear lower 32)
| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
../subprojects/basis_universal/transcoder/basisu_transcoder.cpp:12010:101: error: cannot convert 'uint8x8_t' to 'uint16x4_t'
12010 | uint8x8_t rev8 = vqmovn_u16( vcombine_u16( rev8lohi, vdup_n_u8(0) ) ); // 8bits in 4 u8 (clear lower 32)
| ~~~~~~~~~^~~
| |
| uint8x8_t
An explicit cast fixes it:
Suggested change
uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), bitNum0 ); // bitMask = (1U << n) - 1U | |
uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), bitNum1 ); // bitMask = (1U << n) - 1U | |
uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), (int16x8_t)bitNum0 ); // bitMask = (1U << n) - 1U | |
uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), (int16x8_t)bitNum1 ); // bitMask = (1U << n) - 1U |
Hi, yes you can do a cast on the shift(always left) to satisfy GCC
and you can replace the vdup_n_u8(0) by a vdup_16(0), logically it's the
same (I'm only using clang & msvc)
(sorry I don't have git on this PC)
Le lun. 10 avr. 2023 à 05:26, Andrei Alexeyev ***@***.***> a
écrit :
… ***@***.**** commented on this pull request.
------------------------------
In transcoder/basisu_transcoder.cpp
<#298 (comment)>
:
> + uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), bitNum0 ); // bitMask = (1U << n) - 1U
+ uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), bitNum1 ); // bitMask = (1U << n) - 1U
This fails on GCC without -flax-vector-conversions:
../subprojects/basis_universal/transcoder/basisu_transcoder.cpp: In function 'void basist::pack_astc_block_weights(uint8_t*, const uint8_t*, int, int)':
../subprojects/basis_universal/transcoder/basisu_transcoder.cpp:12010:80: note: use '-flax-vector-conversions' to permit conversions between vectors with differing element types or numbers of subparts
12010 | uint8x8_t rev8 = vqmovn_u16( vcombine_u16( rev8lohi, vdup_n_u8(0) ) ); // 8bits in 4 u8 (clear lower 32)
| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
../subprojects/basis_universal/transcoder/basisu_transcoder.cpp:12010:101: error: cannot convert 'uint8x8_t' to 'uint16x4_t'
12010 | uint8x8_t rev8 = vqmovn_u16( vcombine_u16( rev8lohi, vdup_n_u8(0) ) ); // 8bits in 4 u8 (clear lower 32)
| ~~~~~~~~~^~~
| |
| uint8x8_t
An explicit cast fixes it:
⬇️ Suggested change
- uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), bitNum0 ); // bitMask = (1U << n) - 1U
- uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), bitNum1 ); // bitMask = (1U << n) - 1U
+ uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), (int16x8_t)bitNum0 ); // bitMask = (1U << n) - 1U
+ uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), (int16x8_t)bitNum1 ); // bitMask = (1U << n) - 1U
—
Reply to this email directly, view it on GitHub
<#298 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACOL7CARDZELDN4VTHPMGJDXAN4YBANCNFSM5VIMSKDQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ssse3 (I5 6300) : 163 => 136 ms
arm (A53) : 340 => 282 ms
I moved the block weight transform code in a single function : pack_astc_block_weights
you can enable/disable the SIMD code with a define
BASISD_ASTC_SIMD
All the simd code is annotated
tested x86_64 on windows, compiled with VS2019 and clang 11
tested arm & arm64 on android, compiled with latest NDK (clang11)
if needed, you can get in touch at :
[email protected]
[email protected]