Removes the clever subgroup parallel prefix computation, and instead just computes the prefix inline. Cuts down the number of dispatches by a huge amount. Provides a ~12x speedup (2.5fps to 30fps on a 7900XTX, 2.1fps to 24fps on an Ada). Patch attached.