You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not the original author here, but my specific use case are linear RNNs. These will have an initial hidden state $h_0$ which can be provided as the first token and setting gate[0] = 0, in case an implementation does not allow providing an initial state explicitly, such as for this one. However, this will shorten the length of the remaining actual sequence to process to $2^{n}-1$. This constraint may seem odd from an outsider's perspective who is not familiar with the details of the underlying parallel scan implementation.
I would therefore suggest two solutions for improving this situation:
Allow providing an initial element explicitly. From my understanding of the underlying CUDA implementation, this should be relatively easy to do.
Drop the power of two-sequence length constraint. Probably more difficult to implement and may come with a slight performance penalty but would also cover more use cases should as variable-length inference.
No description provided.
The text was updated successfully, but these errors were encountered: