* support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params