[core] use swap_tensors in group offloading where possible
#13751
+41
−44
swap_tensors in group offloading where possible
#13751