# Implementing HULA ## Introduction The objective of this exercise is to implement a simplified version of [HULA](http://web.mit.edu/anirudh/www/hula-sosr16.pdf). In contrast to ECMP, which selects the next hop randomly, HULA load balances the flows over multiple paths to a destination ToR based on queue occupancy of switches in each path. Thus, it can use the whole bisection bandwidth. To keep the example simple, we implement it on top of source routing exercise. Here is how HULA works: - Each ToR switch generates a HULA packet to each other ToR switch to probe the condition of every path between the source and the destination ToR. Each HULA packet is forwarded to the destination ToR (forward path), collects the maximum queue length it observes while being forwarded, and finally delivers that information to the destination ToR. Based on the congestion information collected via probes, each destination ToR then can maintain the current best path (i.e., least congested path) from each source ToR. To share the best path information with the source ToRs so that the sources can use that information for new flows, the destination ToRs notify source ToRs of the current best path by returning the HULA probe back to the source ToR (reverse path) only if the current best path changes. The probe packets include a HULA header and a list of ports for source routing. We describe the elements of HULA header later. - In the forward path: - Each hop updates the queue length field in the hula header if the local queue depth observed by the HULA packet is larger than maximum queue depth recorded in the probe packet. Thus when the packet reaches the destination ToR, queue length field will be the maximum observed queue length on the forward path. - At destination ToR, 1. find the queue length of current best path from the source ToR. 2. if the new path is better, update the queue length and best path and return the HULA probe to the source path. This is done by setting the direction field in the HULA header and returning the packet to the ingress port. 3. if the probe came through the current best path, the destination ToR just updates the existing value. This is needed to know if the best path got worse and hence allow other paths to replace it later. It is inefficient to save the whole path ID (i.e., sequence of switch IDs) and compare it in the data plane; note, P4 doesn't have a loop construct. Instead, we keep a 32 bit digest of a path in the HULA header. Each destination ToR only saves and compares the digest of the best path along with its queue length. The `hula.digest` field is set by source ToR upon creating the HULA packet and does not change along the path. - In the reverse path: - Each hop will update the "routing next hop" to the destination ToR based on the port it received the HULA packet on (as it was the best path). Then it forwards the packet to the next hop in reverse path based on source routing. - Source ToR also drops the packet. - Now for each data packet, - Each hop hashes the flow header fields and looks into a "flow table". - If it doesn't find the next hop for the flow, looks into "routing next hop" to find the next hop for destination ToR. We assume each ToR serves a /24 IP address. The switch also updates the "flow table". "flow table" prevents the path of a flow to change in order to avoid packet re-ordering and path oscilation during updating next hops. - Otherwise, each hop just uses the next hop. Your switch will have multiple tables, which the control plane will populate with static rules. We have already defined the control plane rules, so you only need to implement the data plane logic of your P4 program. > **Spoiler alert:** There is a reference solution in the `solution` > sub-directory. Feel free to compare your implementation to the reference. ## Step 1: Run the (incomplete) starter code The directory with this README also contains a skeleton P4 program, `hula.p4`, which initially drops all packets. Your job (in the next step) will be to extend it to properly update HULA packets and forward data packets. Before that, let's compile the incomplete `hula.p4` and bring up a switch in Mininet to test its behavior. 1. In your shell, run: ```bash ./run.sh ``` This will: * compile `hula.p4`, and * start a Mininet instance with three ToR switches (`s1`, `s2`, `s3`) and two spine switches ( `s11`, `s22`). * The hosts (`h1`, `h2`, `h3`) are assigned IPs of `10.0.1.1`, `10.0.2.2`, and `10.0.3.3`. 2. You should now see a Mininet command prompt. Just ping `h2` from `h1`: ```bash mininet> h1 ping h2 ``` It doesn't work as no path is set. 3. Type `exit` to close the Mininet command line. The message was not received because each switch is programmed with `hula.p4`, which drops all data packets. Your job is to extend this file. ### A note about the control plane P4 programs define a packet-processing pipeline, but the rules governing packet processing are inserted into the pipeline by the control plane. When a rule matches a packet, its action is invoked with parameters supplied by the control plane as part of the rule. In this exercise, the control plane logic has already been implemented. As part of bringing up the Mininet instance, the `run.sh` script will install packet-processing rules in the tables of each switch. These are defined in the `sX-commands.txt` files, where `X` corresponds to the switch number. **Important:** A P4 program also defines the interface between the switch pipeline and control plane. The `sX-commands.txt` files contain lists of commands for the BMv2 switch API. These commands refer to specific tables, keys, and actions by name, and any changes in the P4 program that add or rename tables, keys, or actions will need to be reflected in these command files. ## Step 2: Implement Hula The `hula.p4` file contains a skeleton P4 program with key pieces of logic replaced by `TODO` comments. These should guide your implementation---replace each `TODO` with logic implementing the missing piece. A complete `hula.p4` will contain the following components: 1. Header type definitions for Ethernet (`ethernet_t`), Hula (`hula_t`), Source Routing (`srcRoute_t`), IPv4 (`ipv4_t`), UDP(`udp_t`). 2. Parsers for the above headers. 3. Registers: - `srcindex_qdepth_reg`: At destination ToR saves queue length of the best path from each Source ToR - `srcindex_digest_reg`: At destination ToR saves the digest of the best path from each Source ToR - `dstindex_nhop_reg`: At each hop, saves the next hop to reach each destination ToR - `flow_port_reg`: At each hop saves the next hop for each flow 4. `hula_fwd table`: looks at the destination IP of a HULA packet. If it is the destination ToR, it runs `hula_dst` action to set `meta.index` field based on source IP (source ToR). The index is used later to find queue depth and digest of current best path from that source ToR. Otherwise, this table just runs `srcRoute_nhop` to perform source routing. 5. `hula_bwd` table: at revere path, updates next hop to the destination ToR using `hula_set_nhop` action. The action updates `dstindex_nhop_reg` register. 6. `hula_src` table checks the source IP address of a HULA packet in reverse path. if this switch is the source, this is the end of reverse path, thus drop the packet. Otherwise use `srcRoute_nhop` action to continue source routing in the reverse path. 7. `hula_nhop` table for data packets, reads destination IP/24 to get an index. It uses the index to read `dstindex_nhop_reg` register and get best next hop to the destination ToR. 8. dmac table just updates ethernet destination address based on next hop. 9. An apply block that has the following logic: * If the packet has a HULA header * In forward path (`hdr.hula.dir==0`): * Apply `hula_fwd` table to check if it is the destination ToR or not * If this switch is the destination ToR (`hula_dst` action ran and set the `meta.index` based on the source IP address): * read `srcindex_qdepth_reg` for the queue length of the current best path from the source ToR * If the new queue length is better, update the entry in `srcindex_qdepth_reg` and save the path digest in `srcindex_digest_reg`. Then return the HULA packet to the source ToR by sending to its ingress port and setting `hula.dir=1` (reverse path) * else, if this HULA packet came through current best path (`hula.digest` is equal to the value in `srcindex_digest_reg`), update its queue length in `srcindex_qdepth_reg`. In this case we don't need to send the HULA packet back, thus drop the packet. * in reverse path (`hdr.hula.dir==1`): * apply `hula_bwd` to update the HULA next hop to the destination ToR * apply `hula_src` table to drop the packet if it is the source ToR of the HULA packet * If it is a data packet * compute the hash of flow * **TODO** read nexthop port from `flow_port_reg` into a temporary variable, say `port`. * **TODO** If no entry found (`port==0`), read next hop by applying `hula_nhop` table. Then save the value into `flow_port_reg` for later packets. * **TODO** if it is found, save `port` into `standard_metadata.egress_spec` to finish routing. * apply `dmac` table to update `ethernet.dstAddr`. This is necessary for the links that send packets to hosts. Otherwise their NIC will drop packets. * udpate TTL 5. **TODO:** An egress control that for HULA packets that are in forward path (`hdr.hula.dir==0`) compares `standard_metadata.deq_qdepth` to `hdr.hula.qdepth` in order to save the maximum in `hdr.hula.qdepth` 7. A deparser that selects the order in which fields inserted into the outgoing packet. 8. A `package` instantiation supplied with the parser, control, checksum verification and recomputation and deparser. ## Step 3: Run your solution 1. Run Mininet same as Step 1 2. Open a separate terminal, go to `exercises/hula`, and run `sudo ./generatehula.py`. This python script makes each ToR switch generate one HULA probe for each other ToR and through each separate forward path. For example, `s1` first probes `s2` via `s11` and then via `s22`. Then `s1` probes `s3` again first via `s11` and then via `s22`. `s2` does the same thing to probe paths to `s1` and `s3`, and so does `s3`. 3. Now run `h1 ping h2`. The ping should work if you have completed the ingress control block in `hula.p4`. Note at this point, every ToR considers all paths are equal because there isn't any congestion in the network. Now we are going to test a more complex scenario. We first create two iperf sessions: one from `h1` to `h3`, and the other from `h2` to `h3`. Since both `s1` and `s2` currently think their best paths to `s3` should go through `s11`, the two connections will use the same spine switch (`s11`). Note we throttled the links from the spine switches to `s3` down to 1Mbps. Hence, each of the two connections achieves only ~512Kbps. Let's confirm this by taking the following steps. 1. open a terminal window on `h1`, `h2` and `h3`: ```bash xterm h1 h2 h3 ``` 2. start iperf server at `h3` ```bash iperf -s -u -i 1 ``` 3. run iperf client at `h1` ```bash iperf -c 10.0.3.3 -t 30 -u -b 2m ``` 4. run iperf client in `h2`. try to do step 3 and 4 simultaneously. ```bash iperf -c 10.0.3.3 -t 30 -u -b 2m ``` While the connections are running, watch the iperf server's output at `h3`. Although there are two completely non-overlapping paths for `h1` and `h2` to reach `h3`, both `h1` and `h2` end up using the same spine, and hence the aggregate throughput of the two connections is capped to 1Mbps. You can confirm this by watching the performance of each connection. Our goal is allowing the two connections to use two different spine switches and hence achieve 1Mbps each. We can do this by first causing congestion on one of the spines. More specifically we'll create congestion at the queue in `s11` facing the link `s11-to-s3` by running a long-running connection (an elephant flow) from `s1` to `s3` through `s11`. Once the queue builds up due to the elephant, then we'll let `s2` generate HULA probes several times so that it can learn to avoid forwarding new flows destined to `s3` through `s11`. The following steps achieve this. 1. open a terminal window on `h1`, `h2` and `h3`. (By the way, if you have already closed mininet, you need to re-run the mininet test and run `generatehula.py` first, to setup initial routes) ```bash xterm h1 h2 h3 ``` 2. start iperf server at `h3` ```bash iperf -s -u -i 1 ``` 3. create a long-running full-demand connection from `h1` to `h3` through `s11`. you can do this by running the following at `h1` ```bash iperf -c 10.0.3.3 -t 3000 -u -b 2m ``` 4. outside mininet (in a separate terminal), go to `exercises/hula`, and run the following several (5 to 10) times ```bash sudo ./generatehula.py ``` This should let `s2` know that the path through `s11` to `s3` is congested and the best path is now through the uncongested spine, `s22`. 5. Now, run iperf client at `h2` ```bash iperf -c 10.0.3.3 -t 30 -u -b 2m ``` You will be able to confirm both iperf sessions achieve 1Mbps because they go through two different spines. ### Food for thought * how can we implement flowlet routing (as opposed to flow routing) say based on the timestamp of packets * in the ingress control logic, the destination ToR always sends a HULA packet back on the reverse path if the queue length is better. But this is not necessary if it came from the best path. Can you improve the code? * the hula packets on the congested path may get dropped or extremely delayed, thus the destination ToR would not be aware of the worsened condition of the current best path. A solution could be that the destination ToR uses a timeout mechanism to ignore the current best path if it doesn't receive a hula packet through it for a long time. How can you implement this inside dataplane? ### Troubleshooting There are several ways that problems might manifest: 1. `hula.p4` fails to compile. In this case, `run.sh` will report the error emitted from the compiler and stop. 2. `hula.p4` compiles but does not support the control plane rules in the `sX-commands.txt` files that `run.sh` tries to install using the BMv2 CLI. In this case, `run.sh` will report these errors to `stderr`. Use these error messages to fix your `hula.p4` implementation. 3. `hula.p4` compiles, and the control plane rules are installed, but the switch does not process packets in the desired way. The `build/logs/.log` files contain trace messages describing how each switch processes each packet. The output is detailed and can help pinpoint logic errors in your implementation. The `build/-.pcap` also contains the pcap of packets on each interface. Use `tcpdump -r -xxx` to print the hexdump of the packets. #### Cleaning up Mininet In the latter two cases above, `run.sh` may leave a Mininet instance running in the background. Use the following command to clean up these instances: ```bash mn -c ``` ## Next Steps Congratulations, your implementation works!