torch.cuda.comm.reduce_add_coalesced# torch.cuda.comm.reduce_add_coalesced(inputs, destination=None, buffer_size=10485760)[source]# Sum tensors from multiple GPUs. Small tensors are first coalesced into a buffer to reduce the number of synchronizations. Parameters inputs (Iterable[Iterable[Tensor]]) – iterable of iterables that contain tensors from a single device. destination (int, optional) – a device on which the output will be placed (default: current device). buffer_size (int) – maximum size of the buffer used for coalescing Returns A tuple of tensors containing an elementwise sum of each group of inputs, placed on the destination device.