Maximum Frustration Unit

We’ve just gotten a new link to the internet at work. After getting online I did some quick tyre–kicking. Since we self–host our git repositories, this included making an SSH connection from an appserver back to the office. I made the usual invocation, but no dice—just a lone blinking carat on my prompt. Over our previous internet connection, this worked just fine.

I immediately suspect things like the ISP blocking ports, or myself having forgotten to forward port 22 to the git–hosting machine. I connect with telnet to reduce the variables.

home ≋ telnet office 22
Connected to office
Escape character is '^]'.
SSH-2.0-OpenSSH_6.0p1 Debian-3ubuntu1.2

Huh, well, that’s weird. The connectivity is fine. So the SSH client is probably getting stuck somewhere during handshake?

home ≋ ssh -v office
OpenSSH_6.6, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /home/user/.ssh/config
# snip
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr hmac-md5
debug1: kex: client->server aes128-ctr hmac-md5
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

… and then nothing, until a timeout. I’m confident the key itself is OK, since I can log in successfully from an in-office machine, but forwarding the SSH agent from that machine to home and trying to come back in leads to the same result. Folk online suggest configuring some ciphers in /etc/ssh/ssh_config, which I don’t think will be related—this client/server pair has communicated just fine in the past. Setting the ciphers indeed does nothing. What next?

Since I can still get shell on both machines, I use tcpdump to inspect traffic from both sides during the SSH handshake. The home machine keeps repeating ACK packets to the office machine, but the office machine doesn’t reply.

This suggests that the packet itself is the variable causing the problem, rather than an issue with the protocol or handshake. Let’s try changing the MTU?

office ≋ sudo ip link set eth0 mtu 1492

This is a shot in the dark—but I can now connect. Now to hunt down where the problem occurs.

home ≋ tracepath -p22 office
1?: [LOCALHOST]                                   pmtu 1500
# snip
15:  office                               9.095ms pmtu 1492

And from the other side:

user@office ≋ tracepath -p22 home
1?: [LOCALHOST]                                   pmtu 1500
1:  Zero                                  0.502ms
1:  Zero                                  0.489ms
2:  Zero                                  0.465ms pmtu 1492
# snip

Looks like the shenanigans are happening at Zero, which is a router—but the MTU values are being reported OK. Why aren’t packets being fragmented like they’re designed to? Well. That’s a question for another day.