# Implementing HULA

## Introduction

The objective of this exercise is to implement a simplified version of 
[HULA](http://web.mit.edu/anirudh/www/hula-sosr16.pdf).
In contrast to ECMP, which selects the next hop randomly, HULA load balances
the flows over multiple paths to a destination ToR based on queue occupancy
of switches in each path. Thus, it can use the whole bisection bandwidth.
To keep the example simple, we implement it on top of source routing exercise.

Here is how HULA works:
- Each ToR switch generates a HULA packet to each other ToR switch
  to probe the condition of every path between the source and the destination ToR.
  Each HULA packet is forwarded to the destination ToR (forward path), collects the maximum 
  queue length it observes while being forwarded, and finally delivers that information
  to the destination ToR. Based on the congestion information collected via probes, 
  each destination ToR then can maintain the current best path (i.e., least congested path)
  from each source ToR. To share the best path information with the source ToRs so that 
  the sources can use that information for new flows, the destination ToRs notify
  source ToRs of the current best path by returning the HULA probe back to the source 
  ToR (reverse path) only if the current best path changes. The probe packets include
  a HULA header and a list of ports for source routing. We describe the elements of HULA header later.
- In the forward path:
  - Each hop updates the queue length field in the hula header if the local queue depth observed by
  the HULA packet is larger than maximum queue depth recorded in the probe packet. Thus when 
  the packet reaches the destination ToR, queue length field will be the maximum observed queue length 
  on the forward path.
  - At destination ToR, 
    1. find the queue length of current best path from the source ToR.
    2. if the new path is better, update the queue length and best path and return
     the HULA probe to the source path. This is done by setting the direction field
     in the HULA header and returning the packet to the ingress port.
    3. if the probe came through the current best path, the destination ToR just updates
     the existing value. This is needed to know if the best path got worse and hence allow 
     other paths to replace it later. It is inefficient to save the whole path ID 
     (i.e., sequence of switch IDs) and compare it in the data plane; 
     note, P4 doesn't have a loop construct. Instead, we keep a 32 bit digest of a 
     path in the HULA header. Each destination ToR only saves and compares the 
     digest of the best path along with its queue length.
     The `hula.digest` field is set by source ToR upon creating the HULA packet
     and does not change along the path.
- In the reverse path:
  - Each hop will update the "routing next hop" to the destination ToR based on the port
   it received the HULA packet on (as it was the best path). Then it forwards the packet
   to the next hop in reverse path based on source routing.
  - Source ToR also drops the packet.
- Now for each data packet,
  - Each hop hashes the flow header fields and looks into a "flow table".
  - If it doesn't find the next hop for the flow, looks into "routing next hop" to 
    find the next hop for destination ToR. We assume each ToR serves a /24 IP address.
    The switch also updates the "flow table". "flow table" prevents the path of a flow to change
    in order to avoid packet re-ordering and path oscilation during updating next hops.
  - Otherwise, each hop just uses the next hop.

Your switch will have multiple tables, which the control plane will
populate with static rules. We have already defined
the control plane rules, so you only need to implement the data plane
logic of your P4 program.

> **Spoiler alert:** There is a reference solution in the `solution`
> sub-directory. Feel free to compare your implementation to the reference.


## Step 1: Run the (incomplete) starter code

The directory with this README also contains a skeleton P4 program,
`hula.p4`, which initially drops all packets. Your job (in the next
step) will be to extend it to properly update HULA packets and forward data packets.

Before that, let's compile the incomplete `hula.p4` and bring up a
switch in Mininet to test its behavior.

1. In your shell, run:
   ```bash
   ./run.sh
   ```
   This will:
   * compile `hula.p4`, and
   * start a Mininet instance with three ToR switches (`s1`, `s2`, `s3`)
     and two spine switches ( `s11`, `s22`).
   * The hosts (`h1`, `h2`, `h3`) are assigned IPs of `10.0.1.1`, `10.0.2.2`, and `10.0.3.3`.

2. You should now see a Mininet command prompt. Just ping `h2` from `h1`:
   ```bash
   mininet> h1 ping h2
   ```
It doesn't work as no path is set.

3. Type `exit` to close the Mininet command line.

The message was not received because each switch is programmed with
`hula.p4`, which drops all data packets. Your job is to extend
this file.

### A note about the control plane

P4 programs define a packet-processing pipeline, but the rules governing packet
processing are inserted into the pipeline by the control plane.  When a rule
matches a packet, its action is invoked with parameters supplied by the control
plane as part of the rule.

In this exercise, the control plane logic has already been implemented.  As
part of bringing up the Mininet instance, the `run.sh` script will install
packet-processing rules in the tables of each switch.  These are defined in the
`sX-commands.txt` files, where `X` corresponds to the switch number.

**Important:** A P4 program also defines the interface between the switch
pipeline and control plane.  The `sX-commands.txt` files contain lists of
commands for the BMv2 switch API. These commands refer to specific tables,
keys, and actions by name, and any changes in the P4 program that add or rename
tables, keys, or actions will need to be reflected in these command files.

## Step 2: Implement Hula

The `hula.p4` file contains a skeleton P4 program with key pieces of
logic replaced by `TODO` comments. These should guide your
implementation---replace each `TODO` with logic implementing the missing piece.

A complete `hula.p4` will contain the following components:

1. Header type definitions for Ethernet (`ethernet_t`), Hula (`hula_t`),
   Source Routing (`srcRoute_t`), IPv4 (`ipv4_t`), UDP(`udp_t`).
2. Parsers for the above headers.
3. Registers:
  - `srcindex_qdepth_reg`: At destination ToR saves queue length of the best path
     from each Source ToR
  - `srcindex_digest_reg`: At destination ToR saves the digest of the best path
     from each Source ToR
  - `dstindex_nhop_reg`: At each hop, saves the next hop to reach each destination ToR
  - `flow_port_reg`: At each hop saves the next hop for each flow
4. `hula_fwd table`: looks at the destination IP of a HULA packet. If it is the destination ToR,
   it runs `hula_dst` action to set `meta.index` field based on source IP (source ToR).
   The index is used later to find queue depth and digest of current best path from that source ToR.
   Otherwise, this table just runs `srcRoute_nhop` to perform source routing.
5. `hula_bwd` table: at revere path, updates next hop to the destination ToR using `hula_set_nhop`
action. The action updates `dstindex_nhop_reg` register.
6. `hula_src` table checks the source IP address of a HULA packet in reverse path.
   if this switch is the source, this is the end of reverse path, thus drop the packet.
   Otherwise use `srcRoute_nhop` action to continue source routing in the reverse path.
7. `hula_nhop` table for data packets, reads destination IP/24 to get an index.
   It uses the index to read `dstindex_nhop_reg` register and get best next hop to the 
   destination ToR.
8. dmac table just updates ethernet destination address based on next hop.
9. An apply block that has the following logic:
  * If the packet has a HULA header
    * In forward path (`hdr.hula.dir==0`):
      * Apply `hula_fwd` table to check if it is the destination ToR or not
      * If this switch is the destination ToR (`hula_dst` action ran and 
      set the `meta.index` based on the source IP address):
        * read `srcindex_qdepth_reg` for the queue length of
       the current best path from the source ToR
        * If the new queue length is better, update the entry in `srcindex_qdepth_reg` and
       save the path digest in `srcindex_digest_reg`. Then return the HULA packet to the source ToR
       by sending to its ingress port and setting `hula.dir=1` (reverse path)
      * else, if this HULA packet came through current best path (`hula.digest` is equal to 
       the value in `srcindex_digest_reg`), update its queue length in `srcindex_qdepth_reg`.
       In this case we don't need to send the HULA packet back, thus drop the packet.
    * in reverse path (`hdr.hula.dir==1`):
      * apply `hula_bwd` to update the HULA next hop to the destination ToR
      * apply `hula_src` table to drop the packet if it is the source ToR of the HULA packet
  * If it is a data packet
    * compute the hash of flow
    * **TODO** read nexthop port from `flow_port_reg` into a temporary variable, say `port`. 
    * **TODO** If no entry found (`port==0`), read next hop by applying `hula_nhop` table.
     Then save the value into `flow_port_reg` for later packets.
    * **TODO** if it is found, save `port` into `standard_metadata.egress_spec` to finish routing.
    * apply `dmac` table to update `ethernet.dstAddr`. This is necessary for the links that send packets
    to hosts. Otherwise their NIC will drop packets.
  * udpate TTL
5. **TODO:** An egress control that for HULA packets that are in forward path (`hdr.hula.dir==0`)
   compares `standard_metadata.deq_qdepth` to `hdr.hula.qdepth` 
     in order to save the maximum in `hdr.hula.qdepth`
7. A deparser that selects the order in which fields inserted into the outgoing
   packet.
8. A `package` instantiation supplied with the parser, control, checksum verification and
   recomputation  and deparser.

## Step 3: Run your solution

1. Run Mininet same as Step 1

2. Open a separate terminal, go to `exercises/hula`, and run `sudo ./generatehula.py`. 
   This python script makes each ToR switch generate one HULA probe for each other ToR and 
   through each separate forward path. For example, `s1` first probes `s2` via `s11` and then via `s22`. 
   Then `s1` probes `s3` again first via `s11` and then via `s22`. `s2` does the same thing to probe
   paths to `s1` and `s3`, and so does `s3`.

3. Now run `h1 ping h2`. The ping should work if you have completed the ingress control block in `hula.p4`.
Note at this point, every ToR considers all paths are equal because there isn't any congestion in the network.
 
Now we are going to test a more complex scenario.

We first create two iperf sessions: one from `h1` to `h3`, and the other from `h2` to `h3`. 
Since both `s1` and `s2` currently think their best paths to `s3` should go through `s11`,
the two connections will use the same spine switch (`s11`). Note we throttled the 
links from the spine switches to `s3` down to 1Mbps. Hence, each of the two connections 
achieves only ~512Kbps. Let's confirm this by taking the following steps.

1. open a terminal window on `h1`, `h2` and `h3`:
```bash
xterm h1 h2 h3
```
2. start iperf server at `h3`
```bash
iperf -s -u -i 1
```
3. run iperf client at `h1`
```bash
iperf -c 10.0.3.3 -t 30 -u -b 2m
```
4. run iperf client in `h2`. try to do step 3 and 4 simultaneously.
```bash
iperf -c 10.0.3.3 -t 30 -u -b 2m
```
While the connections are running, watch the iperf server's output at `h3`.
Although there are two completely non-overlapping paths for `h1` and `h2` to reach `h3`,
both `h1` and `h2` end up using the same spine, and hence the aggregate 
throughput of the two connections is capped to 1Mbps. 
You can confirm this by watching the performance of each connection.


Our goal is allowing the two connections to use two different spine switches and hence achieve
1Mbps each. We can do this by first causing congestion on one of the spines. More specifically
we'll create congestion at the queue in `s11` facing the link `s11-to-s3` by running a 
long-running connection (an elephant flow) from `s1` to `s3` through `s11`. 
Once the queue builds up due to the elephant, then we'll let `s2` generate HULA probes 
several times so that it can learn to avoid forwarding new flows destined to `s3` through `s11`. 
The following steps achieve this.

1. open a terminal window on `h1`, `h2` and `h3`. (By the way, if you have already closed mininet,
you need to re-run the mininet test and run `generatehula.py` first, to setup initial routes)
```bash
xterm h1 h2 h3
```
2. start iperf server at `h3`
```bash
iperf -s -u -i 1
```
3. create a long-running full-demand connection from `h1` to `h3` through `s11`. 
you can do this by running the following at `h1`
```bash
iperf -c 10.0.3.3 -t 3000 -u -b 2m
```
4. outside mininet (in a separate terminal), go to `exercises/hula`, and run the following several (5 to 10) times
```bash
sudo ./generatehula.py
```
This should let `s2` know that the path through `s11` to `s3` is congested and
the best path is now through the uncongested spine, `s22`.
5. Now, run iperf client at `h2`
```bash
iperf -c 10.0.3.3 -t 30 -u -b 2m
```
You will be able to confirm both iperf sessions achieve 1Mbps because they go through two different spines.

### Food for thought
* how can we implement flowlet routing (as opposed to flow routing) say based on the timestamp of packets
* in the ingress control logic, the destination ToR always sends a HULA packet 
back on the reverse path if the queue length is better. But this is not necessary
if it came from the best path. Can you improve the code?
* the hula packets on the congested path may get dropped or extremely delayed,
thus the destination ToR would not be aware of the worsened condition of the current best path.
A solution could be that the destination ToR uses a timeout mechanism to ignore the current best path
if it doesn't receive a hula packet through it for a long time.
How can you implement this inside dataplane? 

### Troubleshooting

There are several ways that problems might manifest:

1. `hula.p4` fails to compile.  In this case, `run.sh` will report the
error emitted from the compiler and stop.

2. `hula.p4` compiles but does not support the control plane rules in
the `sX-commands.txt` files that `run.sh` tries to install using the BMv2 CLI.
In this case, `run.sh` will report these errors to `stderr`.  Use these error
messages to fix your `hula.p4` implementation.

3. `hula.p4` compiles, and the control plane rules are installed, but
the switch does not process packets in the desired way.  The
`build/logs/<switch-name>.log` files contain trace messages describing how each
switch processes each packet.  The output is detailed and can help pinpoint
logic errors in your implementation.
The `build/<switch-name>-<interface-name>.pcap` also contains the pcap of packets on each
interface. Use `tcpdump -r <filename> -xxx` to print the hexdump of the packets.

#### Cleaning up Mininet

In the latter two cases above, `run.sh` may leave a Mininet instance running in
the background. Use the following command to clean up these instances:

```bash
mn -c
```

## Next Steps

Congratulations, your implementation works!