If your interest is to do a robust implementation of a 4x4 batched matrix inverse, I would encourage you to use CUBLAS. If this is for a learning exercise, great. If we follow that, then the code for question 2 could be trivial: $ cat t1487.cuįloat my_val = my_vals]
![100000 permute 2 100000 permute 2](https://s2.studylib.net/store/data/015358678_1-01bd4f172680f379767ef2614684bc2e.png)
I believe this gives the most correspondence to AVX style processing. As suggested in my answer to question 1, one way to think about a 32 item float array is having the array "spread" across a warp. Printf("warp lane: %d, val: %f\n", threadIdx.x&31, my_val) įor question 2 the only thing I can come up with seems trivial. My_val = _shfl_sync(0xFFFFFFFF, my_val, pattern)
#100000 permute 2 mod#
A numpy version of a simple case of what I want to do with length 8 array:Ī = some float32 array of length 8 specific patterns will always cycle mod 4 """ b = a] $ cat t1486.cuįloat my_val = (float)threadIdx.x + 0.1f How do I perform arbitrary permutations of a register float variable (always of length 32)? I have seen suggestions that _shfl_sync will do this, but no example showing this. How do I merge pieces of two register floats into a single register float? In numpy, a simple example would be: """ī = some other float32 array of length 8įor anyone who knows AVX intrinsics, question 1 relates to translation of _mm256_permute_ps, and question two pertains to translation of _mm256_shuffle_ps. Specific patterns will always cycle mod 4 A numpy version of a simple case of what I want to do with length 8 array: """
![100000 permute 2 100000 permute 2](https://25.cdn.ekm.net/ekmps/shops/themapcentre/images/crete-eastern-terrain-maps-1-100-000-[2]-7431-1-p.jpg)
There are, however, two pieces of this translation that elude me for lack of simple examples. I am attempting to translate an AVX routine into CUDA, and most of the effort is very straightforward.