Implementing HULA
Introduction
The objective of this exercise is to implement a simplified version of HULA. In contrast to ECMP, which selects the next hop randomly, HULA load balances the flows over multiple paths to a destination ToR based on queue occupancy of switches in each path. Thus, it can use the whole bisection bandwidth. To keep the example simple, we implement it on top of source routing exercise.
Here is how HULA works:
- Each ToR switch generates a HULA packet to each other ToR switch to probe the condition of every path between the source and the destination ToR. Each HULA packet is forwarded to the destination ToR (forward path), collects the maximum queue length it observes while being forwarded, and finally delivers that information to the destination ToR. Based on the congestion information collected via probes, each destination ToR then can maintain the current best path (i.e., least congested path) from each source ToR. To share the best path information with the source ToRs so that the sources can use that information for new flows, the destination ToRs notify source ToRs of the current best path by returning the HULA probe back to the source ToR (reverse path) only if the current best path changes. The probe packets include a HULA header and a list of ports for source routing. We describe the elements of HULA header later.
- In the forward path:
- Each hop updates the queue length field in the hula header if the local queue depth observed by the HULA packet is larger than maximum queue depth recorded in the probe packet. Thus when the packet reaches the destination ToR, queue length field will be the maximum observed queue length on the forward path.
- At destination ToR,
- find the queue length of current best path from the source ToR.
- if the new path is better, update the queue length and best path and return the HULA probe to the source path. This is done by setting the direction field in the HULA header and returning the packet to the ingress port.
- if the probe came through the current best path, the destination ToR just updates
the existing value. This is needed to know if the best path got worse and hence allow
other paths to replace it later. It is inefficient to save the whole path ID
(i.e., sequence of switch IDs) and compare it in the data plane;
note, P4 doesn't have a loop construct. Instead, we keep a 32 bit digest of a
path in the HULA header. Each destination ToR only saves and compares the
digest of the best path along with its queue length.
The
hula.digest
field is set by source ToR upon creating the HULA packet and does not change along the path.
- In the reverse path:
- Each hop will update the "routing next hop" to the destination ToR based on the port it received the HULA packet on (as it was the best path). Then it forwards the packet to the next hop in reverse path based on source routing.
- Source ToR also drops the packet.
- Now for each data packet,
- Each hop hashes the flow header fields and looks into a "flow table".
- If it doesn't find the next hop for the flow, looks into "routing next hop" to find the next hop for destination ToR. We assume each ToR serves a /24 IP address. The switch also updates the "flow table". "flow table" prevents the path of a flow to change in order to avoid packet re-ordering and path oscilation during updating next hops.
- Otherwise, each hop just uses the next hop.
Your switch will have multiple tables, which the control plane will populate with static rules. We have already defined the control plane rules, so you only need to implement the data plane logic of your P4 program.
Spoiler alert: There is a reference solution in the
solution
sub-directory. Feel free to compare your implementation to the reference.
Step 1: Run the (incomplete) starter code
The directory with this README also contains a skeleton P4 program,
hula.p4
, which initially drops all packets. Your job (in the next
step) will be to extend it to properly update HULA packets and forward data packets.
Before that, let's compile the incomplete hula.p4
and bring up a
switch in Mininet to test its behavior.
-
In your shell, run:
./run.sh
This will:
- compile
hula.p4
, and - start a Mininet instance with three ToR switches (
s1
,s2
,s3
) and two spine switches (s11
,s22
). - The hosts (
h1
,h2
,h3
) are assigned IPs of10.0.1.1
,10.0.2.2
, and10.0.3.3
.
- compile
-
You should now see a Mininet command prompt. Just ping
h2
fromh1
:mininet> h1 ping h2
It doesn't work as no path is set.
- Type
exit
to close the Mininet command line.
The message was not received because each switch is programmed with
hula.p4
, which drops all data packets. Your job is to extend
this file.
A note about the control plane
P4 programs define a packet-processing pipeline, but the rules governing packet processing are inserted into the pipeline by the control plane. When a rule matches a packet, its action is invoked with parameters supplied by the control plane as part of the rule.
In this exercise, the control plane logic has already been implemented. As
part of bringing up the Mininet instance, the run.sh
script will install
packet-processing rules in the tables of each switch. These are defined in the
sX-commands.txt
files, where X
corresponds to the switch number.
Important: A P4 program also defines the interface between the switch
pipeline and control plane. The sX-commands.txt
files contain lists of
commands for the BMv2 switch API. These commands refer to specific tables,
keys, and actions by name, and any changes in the P4 program that add or rename
tables, keys, or actions will need to be reflected in these command files.
Step 2: Implement Hula
The hula.p4
file contains a skeleton P4 program with key pieces of
logic replaced by TODO
comments. These should guide your
implementation---replace each TODO
with logic implementing the missing piece.
A complete hula.p4
will contain the following components:
- Header type definitions for Ethernet (
ethernet_t
), Hula (hula_t
), Source Routing (srcRoute_t
), IPv4 (ipv4_t
), UDP(udp_t
). - Parsers for the above headers.
- Registers:
srcindex_qdepth_reg
: At destination ToR saves queue length of the best path from each Source ToRsrcindex_digest_reg
: At destination ToR saves the digest of the best path from each Source ToRdstindex_nhop_reg
: At each hop, saves the next hop to reach each destination ToRflow_port_reg
: At each hop saves the next hop for each flow
hula_fwd table
: looks at the destination IP of a HULA packet. If it is the destination ToR, it runshula_dst
action to setmeta.index
field based on source IP (source ToR). The index is used later to find queue depth and digest of current best path from that source ToR. Otherwise, this table just runssrcRoute_nhop
to perform source routing.hula_bwd
table: at revere path, updates next hop to the destination ToR usinghula_set_nhop
action. The action updatesdstindex_nhop_reg
register.hula_src
table checks the source IP address of a HULA packet in reverse path. if this switch is the source, this is the end of reverse path, thus drop the packet. Otherwise usesrcRoute_nhop
action to continue source routing in the reverse path.hula_nhop
table for data packets, reads destination IP/24 to get an index. It uses the index to readdstindex_nhop_reg
register and get best next hop to the destination ToR.- dmac table just updates ethernet destination address based on next hop.
- An apply block that has the following logic:
- If the packet has a HULA header
- In forward path (
hdr.hula.dir==0
):- Apply
hula_fwd
table to check if it is the destination ToR or not - If this switch is the destination ToR (
hula_dst
action ran and set themeta.index
based on the source IP address):- read
srcindex_qdepth_reg
for the queue length of the current best path from the source ToR - If the new queue length is better, update the entry in
srcindex_qdepth_reg
and save the path digest insrcindex_digest_reg
. Then return the HULA packet to the source ToR by sending to its ingress port and settinghula.dir=1
(reverse path)
- read
- else, if this HULA packet came through current best path (
hula.digest
is equal to the value insrcindex_digest_reg
), update its queue length insrcindex_qdepth_reg
. In this case we don't need to send the HULA packet back, thus drop the packet.
- Apply
- in reverse path (
hdr.hula.dir==1
):- apply
hula_bwd
to update the HULA next hop to the destination ToR - apply
hula_src
table to drop the packet if it is the source ToR of the HULA packet
- apply
- In forward path (
- If it is a data packet
- compute the hash of flow
- TODO read nexthop port from
flow_port_reg
into a temporary variable, sayport
. - TODO If no entry found (
port==0
), read next hop by applyinghula_nhop
table. Then save the value intoflow_port_reg
for later packets. - TODO if it is found, save
port
intostandard_metadata.egress_spec
to finish routing. - apply
dmac
table to updateethernet.dstAddr
. This is necessary for the links that send packets to hosts. Otherwise their NIC will drop packets.
- udpate TTL
- TODO: An egress control that for HULA packets that are in forward path (
hdr.hula.dir==0
) comparesstandard_metadata.deq_qdepth
tohdr.hula.qdepth
in order to save the maximum inhdr.hula.qdepth
- A deparser that selects the order in which fields inserted into the outgoing packet.
- A
package
instantiation supplied with the parser, control, checksum verification and recomputation and deparser.
Step 3: Run your solution
-
Run Mininet same as Step 1
-
Open a separate terminal, go to
exercises/hula
, and runsudo ./generatehula.py
. This python script makes each ToR switch generate one HULA probe for each other ToR and through each separate forward path. For example,s1
first probess2
vias11
and then vias22
. Thens1
probess3
again first vias11
and then vias22
.s2
does the same thing to probe paths tos1
ands3
, and so doess3
. -
Now run
h1 ping h2
. The ping should work if you have completed the ingress control block inhula.p4
. Note at this point, every ToR considers all paths are equal because there isn't any congestion in the network.
Now we are going to test a more complex scenario.
We first create two iperf sessions: one from h1
to h3
, and the other from h2
to h3
.
Since both s1
and s2
currently think their best paths to s3
should go through s11
,
the two connections will use the same spine switch (s11
). Note we throttled the
links from the spine switches to s3
down to 1Mbps. Hence, each of the two connections
achieves only ~512Kbps. Let's confirm this by taking the following steps.
- open a terminal window on
h1
,h2
andh3
:
xterm h1 h2 h3
- start iperf server at
h3
iperf -s -u -i 1
- run iperf client at
h1
iperf -c 10.0.3.3 -t 30 -u -b 2m
- run iperf client in
h2
. try to do step 3 and 4 simultaneously.
iperf -c 10.0.3.3 -t 30 -u -b 2m
While the connections are running, watch the iperf server's output at h3
.
Although there are two completely non-overlapping paths for h1
and h2
to reach h3
,
both h1
and h2
end up using the same spine, and hence the aggregate
throughput of the two connections is capped to 1Mbps.
You can confirm this by watching the performance of each connection.
Our goal is allowing the two connections to use two different spine switches and hence achieve
1Mbps each. We can do this by first causing congestion on one of the spines. More specifically
we'll create congestion at the queue in s11
facing the link s11-to-s3
by running a
long-running connection (an elephant flow) from s1
to s3
through s11
.
Once the queue builds up due to the elephant, then we'll let s2
generate HULA probes
several times so that it can learn to avoid forwarding new flows destined to s3
through s11
.
The following steps achieve this.
- open a terminal window on
h1
,h2
andh3
. (By the way, if you have already closed mininet, you need to re-run the mininet test and rungeneratehula.py
first, to setup initial routes)
xterm h1 h2 h3
- start iperf server at
h3
iperf -s -u -i 1
- create a long-running full-demand connection from
h1
toh3
throughs11
. you can do this by running the following ath1
iperf -c 10.0.3.3 -t 3000 -u -b 2m
- outside mininet (in a separate terminal), go to
exercises/hula
, and run the following several (5 to 10) times
sudo ./generatehula.py
This should let s2
know that the path through s11
to s3
is congested and
the best path is now through the uncongested spine, s22
.
5. Now, run iperf client at h2
iperf -c 10.0.3.3 -t 30 -u -b 2m
You will be able to confirm both iperf sessions achieve 1Mbps because they go through two different spines.
Food for thought
- how can we implement flowlet routing (as opposed to flow routing) say based on the timestamp of packets
- in the ingress control logic, the destination ToR always sends a HULA packet back on the reverse path if the queue length is better. But this is not necessary if it came from the best path. Can you improve the code?
- the hula packets on the congested path may get dropped or extremely delayed, thus the destination ToR would not be aware of the worsened condition of the current best path. A solution could be that the destination ToR uses a timeout mechanism to ignore the current best path if it doesn't receive a hula packet through it for a long time. How can you implement this inside dataplane?
Troubleshooting
There are several ways that problems might manifest:
-
hula.p4
fails to compile. In this case,run.sh
will report the error emitted from the compiler and stop. -
hula.p4
compiles but does not support the control plane rules in thesX-commands.txt
files thatrun.sh
tries to install using the BMv2 CLI. In this case,run.sh
will report these errors tostderr
. Use these error messages to fix yourhula.p4
implementation. -
hula.p4
compiles, and the control plane rules are installed, but the switch does not process packets in the desired way. Thebuild/logs/<switch-name>.log
files contain trace messages describing how each switch processes each packet. The output is detailed and can help pinpoint logic errors in your implementation. Thebuild/<switch-name>-<interface-name>.pcap
also contains the pcap of packets on each interface. Usetcpdump -r <filename> -xxx
to print the hexdump of the packets.
Cleaning up Mininet
In the latter two cases above, run.sh
may leave a Mininet instance running in
the background. Use the following command to clean up these instances:
mn -c
Next Steps
Congratulations, your implementation works!