Nate Foster ef64a2cfd8 Sigcomm 17 (#52)
This problem contains the tutorial exercises and solutions presented at SIGCOMM '17.
2017-10-03 16:30:47 -04:00
..
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00
2017-10-03 16:30:47 -04:00

Implementing HULA

Introduction

The objective of this exercise is to implement a simplified version of HULA. In contrast to ECMP, which selects the next hop randomly, HULA load balances the flows over multiple paths to a destination ToR based on queue occupancy of switches in each path. Thus, it can use the whole bisection bandwidth. To keep the example simple, we implement it on top of source routing exercise.

Here is how HULA works:

  • Each ToR switch generates a HULA packet to each other ToR switch to probe the condition of every path between the source and the destination ToR. Each HULA packet is forwarded to the destination ToR (forward path), collects the maximum queue length it observes while being forwarded, and finally delivers that information to the destination ToR. Based on the congestion information collected via probes, each destination ToR then can maintain the current best path (i.e., least congested path) from each source ToR. To share the best path information with the source ToRs so that the sources can use that information for new flows, the destination ToRs notify source ToRs of the current best path by returning the HULA probe back to the source ToR (reverse path) only if the current best path changes. The probe packets include a HULA header and a list of ports for source routing. We describe the elements of HULA header later.
  • In the forward path:
    • Each hop updates the queue length field in the hula header if the local queue depth observed by the HULA packet is larger than maximum queue depth recorded in the probe packet. Thus when the packet reaches the destination ToR, queue length field will be the maximum observed queue length on the forward path.
    • At destination ToR,
      1. find the queue length of current best path from the source ToR.
      2. if the new path is better, update the queue length and best path and return the HULA probe to the source path. This is done by setting the direction field in the HULA header and returning the packet to the ingress port.
      3. if the probe came through the current best path, the destination ToR just updates the existing value. This is needed to know if the best path got worse and hence allow other paths to replace it later. It is inefficient to save the whole path ID (i.e., sequence of switch IDs) and compare it in the data plane; note, P4 doesn't have a loop construct. Instead, we keep a 32 bit digest of a path in the HULA header. Each destination ToR only saves and compares the digest of the best path along with its queue length. The hula.digest field is set by source ToR upon creating the HULA packet and does not change along the path.
  • In the reverse path:
    • Each hop will update the "routing next hop" to the destination ToR based on the port it received the HULA packet on (as it was the best path). Then it forwards the packet to the next hop in reverse path based on source routing.
    • Source ToR also drops the packet.
  • Now for each data packet,
    • Each hop hashes the flow header fields and looks into a "flow table".
    • If it doesn't find the next hop for the flow, looks into "routing next hop" to find the next hop for destination ToR. We assume each ToR serves a /24 IP address. The switch also updates the "flow table". "flow table" prevents the path of a flow to change in order to avoid packet re-ordering and path oscilation during updating next hops.
    • Otherwise, each hop just uses the next hop.

Your switch will have multiple tables, which the control plane will populate with static rules. We have already defined the control plane rules, so you only need to implement the data plane logic of your P4 program.

Spoiler alert: There is a reference solution in the solution sub-directory. Feel free to compare your implementation to the reference.

Step 1: Run the (incomplete) starter code

The directory with this README also contains a skeleton P4 program, hula.p4, which initially drops all packets. Your job (in the next step) will be to extend it to properly update HULA packets and forward data packets.

Before that, let's compile the incomplete hula.p4 and bring up a switch in Mininet to test its behavior.

  1. In your shell, run:

    ./run.sh
    

    This will:

    • compile hula.p4, and
    • start a Mininet instance with three ToR switches (s1, s2, s3) and two spine switches ( s11, s22).
    • The hosts (h1, h2, h3) are assigned IPs of 10.0.1.1, 10.0.2.2, and 10.0.3.3.
  2. You should now see a Mininet command prompt. Just ping h2 from h1:

    mininet> h1 ping h2
    

It doesn't work as no path is set.

  1. Type exit to close the Mininet command line.

The message was not received because each switch is programmed with hula.p4, which drops all data packets. Your job is to extend this file.

A note about the control plane

P4 programs define a packet-processing pipeline, but the rules governing packet processing are inserted into the pipeline by the control plane. When a rule matches a packet, its action is invoked with parameters supplied by the control plane as part of the rule.

In this exercise, the control plane logic has already been implemented. As part of bringing up the Mininet instance, the run.sh script will install packet-processing rules in the tables of each switch. These are defined in the sX-commands.txt files, where X corresponds to the switch number.

Important: A P4 program also defines the interface between the switch pipeline and control plane. The sX-commands.txt files contain lists of commands for the BMv2 switch API. These commands refer to specific tables, keys, and actions by name, and any changes in the P4 program that add or rename tables, keys, or actions will need to be reflected in these command files.

Step 2: Implement Hula

The hula.p4 file contains a skeleton P4 program with key pieces of logic replaced by TODO comments. These should guide your implementation---replace each TODO with logic implementing the missing piece.

A complete hula.p4 will contain the following components:

  1. Header type definitions for Ethernet (ethernet_t), Hula (hula_t), Source Routing (srcRoute_t), IPv4 (ipv4_t), UDP(udp_t).
  2. Parsers for the above headers.
  3. Registers:
  • srcindex_qdepth_reg: At destination ToR saves queue length of the best path from each Source ToR
  • srcindex_digest_reg: At destination ToR saves the digest of the best path from each Source ToR
  • dstindex_nhop_reg: At each hop, saves the next hop to reach each destination ToR
  • flow_port_reg: At each hop saves the next hop for each flow
  1. hula_fwd table: looks at the destination IP of a HULA packet. If it is the destination ToR, it runs hula_dst action to set meta.index field based on source IP (source ToR). The index is used later to find queue depth and digest of current best path from that source ToR. Otherwise, this table just runs srcRoute_nhop to perform source routing.
  2. hula_bwd table: at revere path, updates next hop to the destination ToR using hula_set_nhop action. The action updates dstindex_nhop_reg register.
  3. hula_src table checks the source IP address of a HULA packet in reverse path. if this switch is the source, this is the end of reverse path, thus drop the packet. Otherwise use srcRoute_nhop action to continue source routing in the reverse path.
  4. hula_nhop table for data packets, reads destination IP/24 to get an index. It uses the index to read dstindex_nhop_reg register and get best next hop to the destination ToR.
  5. dmac table just updates ethernet destination address based on next hop.
  6. An apply block that has the following logic:
  • If the packet has a HULA header
    • In forward path (hdr.hula.dir==0):
      • Apply hula_fwd table to check if it is the destination ToR or not
      • If this switch is the destination ToR (hula_dst action ran and set the meta.index based on the source IP address):
        • read srcindex_qdepth_reg for the queue length of the current best path from the source ToR
        • If the new queue length is better, update the entry in srcindex_qdepth_reg and save the path digest in srcindex_digest_reg. Then return the HULA packet to the source ToR by sending to its ingress port and setting hula.dir=1 (reverse path)
      • else, if this HULA packet came through current best path (hula.digest is equal to the value in srcindex_digest_reg), update its queue length in srcindex_qdepth_reg. In this case we don't need to send the HULA packet back, thus drop the packet.
    • in reverse path (hdr.hula.dir==1):
      • apply hula_bwd to update the HULA next hop to the destination ToR
      • apply hula_src table to drop the packet if it is the source ToR of the HULA packet
  • If it is a data packet
    • compute the hash of flow
    • TODO read nexthop port from flow_port_reg into a temporary variable, say port.
    • TODO If no entry found (port==0), read next hop by applying hula_nhop table. Then save the value into flow_port_reg for later packets.
    • TODO if it is found, save port into standard_metadata.egress_spec to finish routing.
    • apply dmac table to update ethernet.dstAddr. This is necessary for the links that send packets to hosts. Otherwise their NIC will drop packets.
  • udpate TTL
  1. TODO: An egress control that for HULA packets that are in forward path (hdr.hula.dir==0) compares standard_metadata.deq_qdepth to hdr.hula.qdepth in order to save the maximum in hdr.hula.qdepth
  2. A deparser that selects the order in which fields inserted into the outgoing packet.
  3. A package instantiation supplied with the parser, control, checksum verification and recomputation and deparser.

Step 3: Run your solution

  1. Run Mininet same as Step 1

  2. Open a separate terminal, go to exercises/hula, and run sudo ./generatehula.py. This python script makes each ToR switch generate one HULA probe for each other ToR and through each separate forward path. For example, s1 first probes s2 via s11 and then via s22. Then s1 probes s3 again first via s11 and then via s22. s2 does the same thing to probe paths to s1 and s3, and so does s3.

  3. Now run h1 ping h2. The ping should work if you have completed the ingress control block in hula.p4. Note at this point, every ToR considers all paths are equal because there isn't any congestion in the network.

Now we are going to test a more complex scenario.

We first create two iperf sessions: one from h1 to h3, and the other from h2 to h3. Since both s1 and s2 currently think their best paths to s3 should go through s11, the two connections will use the same spine switch (s11). Note we throttled the links from the spine switches to s3 down to 1Mbps. Hence, each of the two connections achieves only ~512Kbps. Let's confirm this by taking the following steps.

  1. open a terminal window on h1, h2 and h3:
xterm h1 h2 h3
  1. start iperf server at h3
iperf -s -u -i 1
  1. run iperf client at h1
iperf -c 10.0.3.3 -t 30 -u -b 2m
  1. run iperf client in h2. try to do step 3 and 4 simultaneously.
iperf -c 10.0.3.3 -t 30 -u -b 2m

While the connections are running, watch the iperf server's output at h3. Although there are two completely non-overlapping paths for h1 and h2 to reach h3, both h1 and h2 end up using the same spine, and hence the aggregate throughput of the two connections is capped to 1Mbps. You can confirm this by watching the performance of each connection.

Our goal is allowing the two connections to use two different spine switches and hence achieve 1Mbps each. We can do this by first causing congestion on one of the spines. More specifically we'll create congestion at the queue in s11 facing the link s11-to-s3 by running a long-running connection (an elephant flow) from s1 to s3 through s11. Once the queue builds up due to the elephant, then we'll let s2 generate HULA probes several times so that it can learn to avoid forwarding new flows destined to s3 through s11. The following steps achieve this.

  1. open a terminal window on h1, h2 and h3. (By the way, if you have already closed mininet, you need to re-run the mininet test and run generatehula.py first, to setup initial routes)
xterm h1 h2 h3
  1. start iperf server at h3
iperf -s -u -i 1
  1. create a long-running full-demand connection from h1 to h3 through s11. you can do this by running the following at h1
iperf -c 10.0.3.3 -t 3000 -u -b 2m
  1. outside mininet (in a separate terminal), go to exercises/hula, and run the following several (5 to 10) times
sudo ./generatehula.py

This should let s2 know that the path through s11 to s3 is congested and the best path is now through the uncongested spine, s22. 5. Now, run iperf client at h2

iperf -c 10.0.3.3 -t 30 -u -b 2m

You will be able to confirm both iperf sessions achieve 1Mbps because they go through two different spines.

Food for thought

  • how can we implement flowlet routing (as opposed to flow routing) say based on the timestamp of packets
  • in the ingress control logic, the destination ToR always sends a HULA packet back on the reverse path if the queue length is better. But this is not necessary if it came from the best path. Can you improve the code?
  • the hula packets on the congested path may get dropped or extremely delayed, thus the destination ToR would not be aware of the worsened condition of the current best path. A solution could be that the destination ToR uses a timeout mechanism to ignore the current best path if it doesn't receive a hula packet through it for a long time. How can you implement this inside dataplane?

Troubleshooting

There are several ways that problems might manifest:

  1. hula.p4 fails to compile. In this case, run.sh will report the error emitted from the compiler and stop.

  2. hula.p4 compiles but does not support the control plane rules in the sX-commands.txt files that run.sh tries to install using the BMv2 CLI. In this case, run.sh will report these errors to stderr. Use these error messages to fix your hula.p4 implementation.

  3. hula.p4 compiles, and the control plane rules are installed, but the switch does not process packets in the desired way. The build/logs/<switch-name>.log files contain trace messages describing how each switch processes each packet. The output is detailed and can help pinpoint logic errors in your implementation. The build/<switch-name>-<interface-name>.pcap also contains the pcap of packets on each interface. Use tcpdump -r <filename> -xxx to print the hexdump of the packets.

Cleaning up Mininet

In the latter two cases above, run.sh may leave a Mininet instance running in the background. Use the following command to clean up these instances:

mn -c

Next Steps

Congratulations, your implementation works!