Distributed architectures means to use different compute entities.
Nodes can be PCs, workstations, or nodes of a cluster.
They can be connected by TCP/IP or other connections such as InfiniBand.
An alternative term would be heterogeneous systems.

BART uses the Message Passing Interface (MPI) [1] to work on distributed systems.

# Basic Requirements
1.) Install (Open-)MPI on all nodes
2.) make sure bart executable is working on all nodes (compiling on each node may be necessary, especially if different OS are used)
3.) build bart with MPI=1
4.) setup ssh connections between nodes (requires ssh-server on each node as well)
5.) setup file share between the nodes (for file reading/writing)
6.) run bart (with mpirun) and parallel flags (bart -p)

# BART command line interface (no mpi required)
- bart -p <pflags> [-s dim0 ... dimN] [-e maxdim0 ... maxdimN] [-S] tool <args>
  - l <lflags>: specifies dimensions of looping which will be parallelized
  - -s: gives start dimensions or (if -e not given) slice which should be processed
  - -e: gives max in dimensions (imagine a for loop with i < maxdim)
  - Example 1: loop in dimension 13 from item 1 to 3
    - bart -l $(bart bitmask 13) -s 1 -e 4 fft 3 example_file.ra
  - Example 2: process only slice 2 in a 3D stack
    - bart -l (bart bitmask 3) -s 2 fft 1 example_file.ra
  - Example 3: loop in dimension 13 over all items (3 items required in this dimensions for this example)
    - bart -l (bart bitmask 13) -e 3 example_file.ra
  - Example 4: loop in dimensions 12 and 13 over items 2 to 3 and 3 to 5
    - bart -l (bart bitmask 12 13) -s 2:3 -e 3:5 fft 1 example_file.ra

# Useful commands
- [Setup ssh key]: ssh-keygen -t rsa -b 2024 -f ~/.ssh/mpi
   - generate rsa key with bitlength 2024 in file ~/.ssh/mpi
- [Copy ssh key]: ssh-copy-id <user>@<node>
   - copies public key from <user> to <node>, 'ssh-copy-id' has to be installed separately
   - Advantage: you can not accidentally share your private key
- [Setup file share over ssh]: sshfs <user>@<node>:<wd> <wd_on_localhost>
  - mounts <wd> on <wd_on_localhost>
  - <user>, user on <node>, Note: if ssh is setup correctly, you don't need the same users on each nodes
- [Unmount file share over ssh]: fusermount -u <abs_path>
- [Run bart with MPI]: 'mpirun -n <x> --host <host>:<slots> -x <env_var> -wdir <wd> bart -p <pflags> [-s dim0 ... dimN] [-e max0 ... maxN] [-S] tool <args> : ...
   - -n: <x> specifies how many slots should be used, slots != processes
   - --host <host> ip address or better ssh-alias, <slots> specify how many slots should max available, has to correspond with <x>
   - -x <env_var> if needed set environment variable (e.g.: useful for DEBUG_LEVEL)
   - -wdir <wd> specify working directory, could be different on each node
   - bart has to be the absolute path to the bart executable on each node if not the same or in <pwd>
   - : separates different node configurations (basically repeat everything for each node with adopted parameters, if they are the same, you don't have to separate them, just use --hostfile <list of compute entities>)
- [Setup symbolic link]: ln -s <source> <target>
   - Setup an symbolic link to create the same file structure on nodes to avoid the need for '-wdir'


# Useful files
- ~/.ssh/config:
"Host <node_name other nodes should refer to>
  User <user of this node>
  HostName <ip of other node>
  IdentityFile <public key that should be copied>"
  - alternatively setup '/etc/hosts', to use names for nodes



# Troublehooting
- Q: Nothing happens at all after running 'mpirun'
  A: Check the mpi versions on each node, they should be the same or at least same major version
     'mpirun --version'

- Q: "No protocol specified":
  A: This is an warning/error from ssh. It means that no protocol for X is specified

- Q: "ssh: connect to host <ip addr> port 22: Connection refused"
  A: Most likely you have to install openssh-server on host

- Q: "WARNING: Open MPI failed to TCP connect to a peer MPI process."
  A: Most likely you didn't setup ssh correctly, check each node connection again.
     You should be able to connect to each node and vice versa

- Q: "/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33'
      /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34'"
- A: Most likely you nodes use different OS (distributions) with different versions of GLIBC.
     Compile bart on each node separately
     Use the absolute path to this binary in your call of 'mpirun', different nodes can be separated by ':' 

[1] https://mpitutorial.com/
