SSL handshake failure with 1.12 storage nodes over I2P #2861

Open
opened 2016-12-30 16:44:13 +00:00 by str4d · 20 comments
str4d commented 2016-12-30 16:44:13 +00:00
Owner

Steps to recreate:

  • Create a storage node with --hide-ip --listen=i2p and connections tcp = disabled.
  • Set the introducer to an I2P grid.
  • Start the storage node.

Expected behaviour:

The storage node connects to all existing storage nodes that are online and reachable.

Actual behaviour:

The storage node connects fine to older storage nodes (running the patched 1.9.2 or 1.10.0 Tahoe-LAFS+I2P builds), but fails to connect to 1.12.0 storage nodes (including itself via loopback). Specifically, the web UI shows that it gets past "connecting" to "negotiating", and then throws:

negotiation failed: [('SSL routines', 'SSL23_READ', 'ssl handshake failure')]
**Steps to recreate:** - Create a storage node with `--hide-ip --listen=i2p` and `connections tcp = disabled`. - Set the introducer to an I2P grid. - Start the storage node. **Expected behaviour:** The storage node connects to all existing storage nodes that are online and reachable. **Actual behaviour:** The storage node connects fine to older storage nodes (running the patched 1.9.2 or 1.10.0 Tahoe-LAFS+I2P builds), but fails to connect to 1.12.0 storage nodes (including itself via loopback). Specifically, the web UI shows that it gets past "connecting" to "negotiating", and then throws: ``` negotiation failed: [('SSL routines', 'SSL23_READ', 'ssl handshake failure')] ```
tahoe-lafs added the
unknown
critical
defect
1.12.0
labels 2016-12-30 16:44:13 +00:00
tahoe-lafs added this to the 1.12.1 milestone 2016-12-30 16:44:13 +00:00
warner commented 2017-01-01 19:08:56 +00:00
Author
Owner

Huh, I don't know what that could be. My hunch is that the I2P connection is not really established, and the negotiation messages are getting dropped or corrupted somehow.

Let's try to extract more information from foolscap:

  • shut everything down
  • start your Introducer
  • start your client node
  • flogtool tail -s out.flog CLIENTNODEDIR/private/logport.furl
  • start a 1.12.0 storage node

When the storage node's announcement arrives (via the Introducer), the client will attempt to connect, and will record some of the negotiation process into out.flog. We're looking for deviations from the usual negotiation process, maybe something about an expected message not being seen.

If that doesn't yield anything immediately useful, the next step will be to modify the foolscap code on both sides and have them log everything they get over the connection. It'd be interesting to know whether they make it far enough to switch to TLS (and it's really the TLS handshake that's getting broken), or if something goes awry before that point (when they're still speaking HTTP-ish).

For background, Foolscap starts by making a plain TCP connection, then exchanges a very HTTP-like request and response, then both sides are supposed to do .start_tls(). So the connecting host will send GET /id/$tubid HTTP/1.1 and some headers (including Upgrade: TLS/1.0) and a double-newline and then the very next byte will be the TLS negotiation (maybe CLIENTHELLO?). The receiving host will send HTTP/1.1 101 Switching Protocols and some headers and a double-newline and then start on the TLS bytes.

Foolscap is expecting the connection it gets to be 8bit-clean and transparent. Can you think of any reason why the I2P proxy might be interpreting HTTP-like data inside the connection and maybe modifying the data or its behavior in response to what it sees?

Huh, I don't know what that could be. My hunch is that the I2P connection is not really established, and the negotiation messages are getting dropped or corrupted somehow. Let's try to extract more information from foolscap: * shut everything down * start your Introducer * start your client node * `flogtool tail -s out.flog CLIENTNODEDIR/private/logport.furl` * start a 1.12.0 storage node When the storage node's announcement arrives (via the Introducer), the client will attempt to connect, and will record some of the negotiation process into `out.flog`. We're looking for deviations from the usual negotiation process, maybe something about an expected message not being seen. If that doesn't yield anything immediately useful, the next step will be to modify the foolscap code on both sides and have them log everything they get over the connection. It'd be interesting to know whether they make it far enough to switch to TLS (and it's really the TLS handshake that's getting broken), or if something goes awry before that point (when they're still speaking HTTP-ish). For background, Foolscap starts by making a plain TCP connection, then exchanges a very HTTP-like request and response, then both sides are supposed to do `.start_tls()`. So the connecting host will send `GET /id/$tubid HTTP/1.1` and some headers (including `Upgrade: TLS/1.0`) and a double-newline and then the very next byte will be the TLS negotiation (maybe CLIENTHELLO?). The receiving host will send `HTTP/1.1 101 Switching Protocols` and some headers and a double-newline and then start on the TLS bytes. Foolscap is expecting the connection it gets to be 8bit-clean and transparent. Can you think of any reason why the I2P proxy might be interpreting HTTP-like data inside the connection and maybe modifying the data or its behavior in response to what it sees?
str4d commented 2017-01-10 17:56:12 +00:00
Author
Owner

Attachment introducer-log.txt (15435 bytes) added

Log of 1.12 introducer receiving connection from 1.12 storage node

**Attachment** introducer-log.txt (15435 bytes) added Log of 1.12 introducer receiving connection from 1.12 storage node
str4d commented 2017-01-10 18:15:06 +00:00
Author
Owner

This issue affects a 1.12 storage node connecting to a 1.12 introducer as well. See the attached logs.

This issue affects a 1.12 storage node connecting to a 1.12 introducer as well. See the attached logs.
str4d commented 2017-01-10 18:22:28 +00:00
Author
Owner

Attachment storage-log.txt (28300 bytes) added

Log of 1.12 storage node connecting to 1.12 introducer over I2P

**Attachment** storage-log.txt (28300 bytes) added Log of 1.12 storage node connecting to 1.12 introducer over I2P
str4d commented 2017-01-10 20:05:29 +00:00
Author
Owner

Attachment client-to-introducer-1.12.ssldump.txt (6482 bytes) added

ssldump of communication between Tahoe 1.12 client and 1.12 introducer

**Attachment** client-to-introducer-1.12.ssldump.txt (6482 bytes) added ssldump of communication between Tahoe 1.12 client and 1.12 introducer
str4d commented 2017-01-10 20:10:17 +00:00
Author
Owner

Attachment client-to-introducer-1.11.ssldump.txt (3622 bytes) added

ssldump of communication between Tahoe 1.12 client and 1.11 introducer

**Attachment** client-to-introducer-1.11.ssldump.txt (3622 bytes) added ssldump of communication between Tahoe 1.12 client and 1.11 introducer
str4d commented 2017-01-10 20:18:58 +00:00
Author
Owner

The two new files show the network behaviour for a 1.12 client connecting over I2P to a 1.11 introducer vs a 1.12 introducer. The 1.11 introducer is my custom-patched build, but the patches don't touch negotiation at all, so it is equivalent to stock 1.11 for these purposes.

The only difference between the traces pre-TLS is that the 1.12 introducer sends the client a bunch of extra cruft at the end of its HTTP/1.1 101 Switching Protocols packet. The client doesn't seem to care, though.

The client then sends a ClientHello packet to the introducer, which AFAICT is identical in both cases. However, the 1.11 introducer responds with a ServerHello, while the 1.12 introducer closes the connection.

The two new files show the network behaviour for a 1.12 client connecting over I2P to a 1.11 introducer vs a 1.12 introducer. The 1.11 introducer is my custom-patched build, but the patches don't touch negotiation at all, so it is equivalent to stock 1.11 for these purposes. The only difference between the traces pre-TLS is that the 1.12 introducer sends the client a bunch of extra cruft at the end of its `HTTP/1.1 101 Switching Protocols` packet. The client doesn't seem to care, though. The client then sends a `ClientHello` packet to the introducer, which AFAICT is identical in both cases. However, the 1.11 introducer responds with a `ServerHello`, while the `1.12` introducer closes the connection.
str4d commented 2017-01-10 20:21:03 +00:00
Author
Owner

Note that the introducer-to-client cruft visible in the ssldump trace looks exactly the same (at least, the first few bytes match) as that in the additional dataReceived() calls in the earlier storage node logs.

Note that the introducer-to-client cruft visible in the `ssldump` trace looks exactly the same (at least, the first few bytes match) as that in the additional `dataReceived()` calls in the earlier storage node logs.
str4d commented 2017-01-10 20:26:52 +00:00
Author
Owner

Attachment client-to-introducer-1.12-noi2p.ssldump.txt (3493 bytes) added

ssldump of communication between Tahoe 1.12 client and 1.12 introducer using localhost TCP instead of I2P

**Attachment** client-to-introducer-1.12-noi2p.ssldump.txt (3493 bytes) added ssldump of communication between Tahoe 1.12 client and 1.12 introducer using localhost TCP instead of I2P
str4d commented 2017-01-10 20:27:52 +00:00
Author
Owner

Okay, it is definitely related to whatever that cruft is. I manually tweaked the tahoe.cfg files of the 1.12 client and 1.12 introducer to use localhost TCP instead of I2P. The cruft disappeared and the connection started working.

Okay, it is definitely related to whatever that cruft is. I manually tweaked the tahoe.cfg files of the 1.12 client and 1.12 introducer to use localhost TCP instead of I2P. The cruft disappeared and the connection started working.
warner commented 2017-01-17 16:40:43 +00:00
Author
Owner

we're on a tight timeline for the debian freeze.. I think we need to get 1.12.1 released in about 24 hours to get it into Stretch. Any progress on this one?

we're on a tight timeline for the debian freeze.. I think we need to get 1.12.1 released in about 24 hours to get it into Stretch. Any progress on this one?
warner commented 2017-01-19 01:09:13 +00:00
Author
Owner

Ticket retargeted after milestone closed

Ticket retargeted after milestone closed
tahoe-lafs modified the milestone from 1.12.1 to 1.13.0 2017-01-19 01:09:13 +00:00
warner commented 2017-01-31 19:31:31 +00:00
Author
Owner

We did a lot of digging in today's devchat, and learned the following:

  • txi2p's "listening" sockets are actually clients: the tahoe/foolscap/txi2p side makes an outbound TCP connection to the local I2P daemon (to the "SAM" API port), writes a message that says "hi, I'd like to receive connections for i2p address BLAH", then waits for a response. Later, when someone connects to the I2P daemon using that address, the tahoe node gets a message on the socket that says "connection incoming!", then the next byte is data from the remote end
  • Twisted's startTLS() method knows whether its connection is a client-like or a server-like connection (transport._tlsClientDefault), and tells TLS to use a ClientHello or ServerHello to match
  • normal Foolscap outbound connections use IStreamClientEndpoint, and inbound Listeners use IStreamServerEndpoint, so TLS gets the right direction. The txtorcon onion-service listener uses a server-like connection, so that works too.
  • startTLS is probably getting the direction of the connection backwards for txi2p's listener, because it's using an outbound connection to the SAM endpoint, but then needs to run the server side of the TLS handshake
  • you can use transport.startTLS(ctx, normal=False) to tell it to flip the direction, which would probably help

We don't yet know a good way to tell Foolscap that it needs to pass in this argument. Some options:

  • the foolscap connection-handler could emit an additional value (so handler.hint_to_endpoint() could somehow return (endpoint, tls_is_reversed)
  • the handler could be responsible for producing an ITransport that has an extra attribute, and foolscap could check this attribute before calling startTLS: normal = getattr(self.transport, "_foolscap_tls_is_normal", True)
  • the handler could produce an ITransport that overrides startTLS() to upcall with the right normal argument

One complication is that it isn't always obvious (to e.g. txi2p) that the connection it was given is a client-like or server-like transport (or whether it's capable of startTLS at all). It's unfortunate that startTLS() takes normal= rather than isClient=. Foolscap knows for sure whether it wants TLS to be client-like or server-like, but when the only knob we have is normal=, we must also know whether the underlying ITransport is client-like or server-like (so we know when to reverse TLS's handling). As far as we've been able to tell, the ITransport client-vs-server flag is private, even though the normal= argument is public.

Some additional things to check before diving too deep into finding a good approach:

  • look at the extra cruft in the "client-to-introducer-1.12.ssldump.txt" trace above, immediately following the Client: Upgrade message, and confirm that it really is a ClientHello. (it is supposed to be a ServerHello, but if startTLS is confused by txi2p using client-like connections, it makes sense that we'd send a ClientHello here). We don't know why tlsdump didn't parse it as such (maybe it wasn't expecting a TLS packet to appear in the middle of a protocol stream, which would imply that tlsdump doesn't handle STARTTLS-like protocols very well). Either compare these bytes against a normal wireshark trace, or look up the TLS docs and manually check the packet format. str4d astutely noticed that the cruft bytes include things like "c0 30" and "c0 2c", which were identified by tlsdump (in the "-noi2p.ssldump.txt" trace) as unrecognized ciphersuite values, and that only the ClientHello contains multiple ciphersuites (since the ServerHello only contains the decision). He also noticed that the Foolscap server shouldn't be sending any TLS messages at all until the client has sent the ClientHello, since TLS servers make the decision, so they can't send anything without first hearing the client's hello.
  • hack something (probably foolscap/negotiate.py) to set normal=False and see if that makes the connections work

Other possibilities that we came up with:

  • txi2p's "parsely" parser might be incorrectly matching something in the TLS Hello and thinking it's a SAM message
  • txi2p's protocol handoff (where the SAM parser stops, and all further bytes are delivered to the wrapped protocol) might be dropping, duplicating, delaying, or reordering bytes, causing a TLS message to be corrupted or delivered twice. Foolscap has suffered from reentrancy and buffering problems in areas like this in the past; it's fertile breeding ground for bugs.

We identified at least two concerns about the way txi2p is working, that shouldn't affect correctness but probably affect performance:

  • txi2p.sam.stream.StreamAcceptReceiver.dataReceived: any application data that is received in the same chunk as the initial peer-destination line will be delayed. It gets stashed as self.initialData properly, but will not be delivered until the next dataReceived is called. If the peer sends an initial chunk and then waits for a response, the local application will never receive that chunk. This is not a problem for client-goes-first protocols like HTTP, but would cause a loss of progress for server-goes-first protocols like SMTP
  • inbound streams will have application data delivered one byte at a time, because the Parsely -based parser used by StreamAcceptReceiver (txi2p/grammar.py) uses an anything:data clause to match all bytes once the parser has moved into the post-SAM State_readData state, and that clause probably just matches a single wildcard byte. This is sound, but probably bad for performance (especially for foolscap), since a large chain of python methods will be executed for every byte of the input. It would be fastest if large bytestrings could be transferred in complete buffers in a single call. We should do some performance tests on this and compare the CPU usage of a tahoe server (during file upload) for a given fixed data rate, I2P vs plain TCP. Ideally the txi2p parser would be bypassed completely once a connection has been moved to State_readData, similar to twisted.protocols.basic.LineReceiver.setRawMode(), but doing that safely requires careful attention to the .dataReceived() ordering/duplication/reentrancy concerns described above.
We did a lot of digging in today's devchat, and learned the following: * txi2p's "listening" sockets are actually clients: the tahoe/foolscap/txi2p side makes an outbound TCP connection to the local I2P daemon (to the "SAM" API port), writes a message that says "hi, I'd like to receive connections for i2p address BLAH", then waits for a response. Later, when someone connects to the I2P daemon using that address, the tahoe node gets a message on the socket that says "connection incoming!", then the next byte is data from the remote end * Twisted's `startTLS()` method knows whether its connection is a client-like or a server-like connection (`transport._tlsClientDefault`), and tells TLS to use a ClientHello or ServerHello to match * normal Foolscap outbound connections use IStreamClientEndpoint, and inbound Listeners use IStreamServerEndpoint, so TLS gets the right direction. The `txtorcon` onion-service listener uses a server-like connection, so that works too. * startTLS is probably getting the direction of the connection backwards for txi2p's listener, because it's using an outbound connection to the SAM endpoint, but then needs to run the *server* side of the TLS handshake * you can use `transport.startTLS(ctx, normal=False)` to tell it to flip the direction, which would probably help We don't yet know a good way to tell Foolscap that it needs to pass in this argument. Some options: * the foolscap connection-handler could emit an additional value (so `handler.hint_to_endpoint()` could somehow return `(endpoint, tls_is_reversed)` * the handler could be responsible for producing an ITransport that has an extra attribute, and foolscap could check this attribute before calling startTLS: `normal = getattr(self.transport, "_foolscap_tls_is_normal", True)` * the handler could produce an ITransport that overrides `startTLS()` to upcall with the right `normal` argument One complication is that it isn't always obvious (to e.g. txi2p) that the connection it was given is a client-like or server-like transport (or whether it's capable of startTLS at all). It's unfortunate that `startTLS()` takes `normal=` rather than `isClient=`. Foolscap knows for sure whether it wants TLS to be client-like or server-like, but when the only knob we have is `normal=`, we must *also* know whether the underlying ITransport is client-like or server-like (so we know when to reverse TLS's handling). As far as we've been able to tell, the ITransport client-vs-server flag is private, even though the `normal=` argument is public. Some additional things to check before diving too deep into finding a good approach: * look at the extra cruft in the "client-to-introducer-1.12.ssldump.txt" trace above, immediately following the `Client: Upgrade` message, and confirm that it really is a ClientHello. (it is supposed to be a ServerHello, but if startTLS is confused by txi2p using client-like connections, it makes sense that we'd send a ClientHello here). We don't know why tlsdump didn't parse it as such (maybe it wasn't expecting a TLS packet to appear in the middle of a protocol stream, which would imply that tlsdump doesn't handle STARTTLS-like protocols very well). Either compare these bytes against a normal wireshark trace, or look up the TLS docs and manually check the packet format. str4d astutely noticed that the cruft bytes include things like "c0 30" and "c0 2c", which were identified by tlsdump (in the "-noi2p.ssldump.txt" trace) as unrecognized ciphersuite values, and that only the ClientHello contains multiple ciphersuites (since the ServerHello only contains the decision). He also noticed that the Foolscap server shouldn't be sending any TLS messages at all until the client has sent the ClientHello, since TLS servers make the decision, so they can't send anything without first hearing the client's hello. * hack something (probably foolscap/negotiate.py) to set `normal=False` and see if that makes the connections work Other possibilities that we came up with: * txi2p's "parsely" parser might be incorrectly matching something in the TLS Hello and thinking it's a SAM message * txi2p's protocol handoff (where the SAM parser stops, and all further bytes are delivered to the wrapped protocol) might be dropping, duplicating, delaying, or reordering bytes, causing a TLS message to be corrupted or delivered twice. Foolscap has suffered from reentrancy and buffering problems in areas like this in the past; it's fertile breeding ground for bugs. We identified at least two concerns about the way txi2p is working, that shouldn't affect correctness but probably affect performance: * `txi2p.sam.stream.StreamAcceptReceiver.dataReceived`: any application data that is received in the same chunk as the initial peer-destination line will be delayed. It gets stashed as `self.initialData` properly, but will not be delivered until the next `dataReceived` is called. If the peer sends an initial chunk and then waits for a response, the local application will never receive that chunk. This is not a problem for client-goes-first protocols like HTTP, but would cause a loss of progress for server-goes-first protocols like SMTP * inbound streams will have application data delivered one byte at a time, because the Parsely -based parser used by `StreamAcceptReceiver` (`txi2p/grammar.py`) uses an `anything:data` clause to match all bytes once the parser has moved into the post-SAM `State_readData` state, and that clause probably just matches a single wildcard byte. This is sound, but probably bad for performance (especially for foolscap), since a large chain of python methods will be executed for every byte of the input. It would be fastest if large bytestrings could be transferred in complete buffers in a single call. We should do some performance tests on this and compare the CPU usage of a tahoe server (during file upload) for a given fixed data rate, I2P vs plain TCP. Ideally the txi2p parser would be bypassed completely once a connection has been moved to `State_readData`, similar to `twisted.protocols.basic.LineReceiver.setRawMode()`, but doing that safely requires careful attention to the `.dataReceived()` ordering/duplication/reentrancy concerns described above.
warner commented 2017-01-31 19:55:59 +00:00
Author
Owner

I've confirmed that a quick hack to call startTLS(ctx, normal=False) on an I2P-based Introducer (but not the client) was enough to allow a connection to get through. Not a solution yet, but it suggests we're headed in the right direction.

I've confirmed that a quick hack to call `startTLS(ctx, normal=False)` on an I2P-based Introducer (but not the client) was enough to allow a connection to get through. Not a solution yet, but it suggests we're headed in the right direction.
warner commented 2017-01-31 21:32:24 +00:00
Author
Owner

Hm, it might be cleanest to get a new API added to Twisted: maybe a side= argument to ITLSTransport.startTLS(). It would override normal=, and would explicitly declare the TLS stack to be "client" or "server", independent of the underlying transport. Foolscap knows exactly which TLS side it's supposed to be, so it could just set it directly.

I'll see if there are any relevant Twisted tickets on this, and if not I'll add one.

Hm, it might be cleanest to get a new API added to Twisted: maybe a `side=` argument to `ITLSTransport.startTLS()`. It would override `normal=`, and would explicitly declare the TLS stack to be "client" or "server", independent of the underlying transport. Foolscap knows exactly which TLS side it's supposed to be, so it could just set it directly. I'll see if there are any relevant Twisted tickets on this, and if not I'll add one.
warner commented 2017-01-31 22:10:07 +00:00
Author
Owner

Twisted#3278 appears to be the same issue (two AMP clients trying to use TLS). The patch on that ticket is 9 years old (!), so will undoubtedly need some cleanup, but it's probably a good place to start.

[Twisted#3278](http://twistedmatrix.com/trac/ticket/3278) appears to be the same issue (two AMP clients trying to use TLS). The patch on that ticket is 9 years old (!), so will undoubtedly need some cleanup, but it's probably a good place to start.
warner commented 2017-01-31 23:12:59 +00:00
Author
Owner

and I've confirmed that the 16 03 01 00 e0 01 00 00 dc 03 03 22 71 sequence (in the "cruft") is a ClientHello: the 16 means "Handshake", 03 01 is TLS-1.0, 00 e0 is a packet length, 01 is ClientHello, then 00 00 dc` is handshake length (all according to the wikipedia TLS diagram).

and I've confirmed that the `16 03 01 00 e0 01 00 00 dc 03 03 22 71` sequence (in the "cruft") is a ClientHello: the `16` means "Handshake", `03 01` is TLS-1.0, `00 e0` is a packet length, `01` is ClientHello`, then `00 00 dc` is handshake length (all according to the [wikipedia TLS diagram](https://en.wikipedia.org/wiki/Transport_Layer_Security#Handshake_protocol)).
str4d commented 2017-04-03 06:06:54 +00:00
Author
Owner

In order to make storage nodes usable in the short term, I've pushed a workaround for this bug to txi2p, which I will release shortly as 0.3.2. It's not particularly robust, as it assumes that the underlying transport always behaves as a client; I agree with warner that a new API in Twisted would be a much better solution.

In order to make storage nodes usable in the short term, I've pushed [a workaround for this bug](https://github.com/str4d/txi2p/commit/41a68363a60cbc861c9826ef43e0ab65cbcedca1) to txi2p, which I will release shortly as 0.3.2. It's not particularly robust, as it assumes that the underlying transport always behaves as a client; I agree with warner that a new API in Twisted would be a much better solution.
exarkun commented 2020-06-30 14:45:13 +00:00
Author
Owner

Moving open issues out of closed milestones.

Moving open issues out of closed milestones.
tahoe-lafs modified the milestone from 1.13.0 to 1.15.0 2020-06-30 14:45:13 +00:00
meejah commented 2021-03-30 18:40:19 +00:00
Author
Owner

Ticket retargeted after milestone closed

Ticket retargeted after milestone closed
tahoe-lafs modified the milestone from 1.15.0 to soon 2021-03-30 18:40:19 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tahoe-lafs/trac-2024-07-25#2861
No description provided.