NFS clients send requests to NFS servers via Remote Procedure
Calls, or RPCs. The RPC client discovers remote service
endpoints automatically, handles per-request authentication,
adjusts request parameters for different byte endianness on
client and server, and retransmits requests that may have been
lost by the network or server. RPC requests and replies flow
over a network transport.
In most cases, the mount(8) command, NFS client, and NFS server
can automatically negotiate proper transport and data transfer
size settings for a mount point. In some cases, however, it pays
to specify these settings explicitly using mount options.
Traditionally, NFS clients used the UDP transport exclusively for
transmitting requests to servers. Though its implementation is
simple, NFS over UDP has many limitations that prevent smooth
operation and good performance in some common deployment
environments. Even an insignificant packet loss rate results in
the loss of whole NFS requests; as such, retransmit timeouts are
usually in the subsecond range to allow clients to recover
quickly from dropped requests, but this can result in extraneous
network traffic and server load.
However, UDP can be quite effective in specialized settings where
the networks MTU is large relative to NFSs data transfer size
(such as network environments that enable jumbo Ethernet frames).
In such environments, trimming the rsize
and wsize
settings so
that each NFS read or write request fits in just a few network
frames (or even in a single frame) is advised. This reduces
the probability that the loss of a single MTU-sized network frame
results in the loss of an entire large read or write request.
TCP is the default transport protocol used for all modern NFS
implementations. It performs well in almost every conceivable
network environment and provides excellent guarantees against
data corruption caused by network unreliability. TCP is often a
requirement for mounting a server through a network firewall.
Under normal circumstances, networks drop packets much more
frequently than NFS servers drop requests. As such, an
aggressive retransmit timeout setting for NFS over TCP is
unnecessary. Typical timeout settings for NFS over TCP are
between one and ten minutes. After the client exhausts its
retransmits (the value of the retrans
mount option), it assumes a
network partition has occurred, and attempts to reconnect to the
server on a fresh socket. Since TCP itself makes network data
transfer reliable, rsize
and wsize
can safely be allowed to
default to the largest values supported by both client and
server, independent of the network's MTU size.
Using the mountproto mount option
This section applies only to NFS version 2 and version 3 mounts
since NFS version 4 does not use a separate protocol for mount
requests.
The Linux NFS client can use a different transport for contacting
an NFS server's rpcbind service, its mountd service, its Network
Lock Manager (NLM) service, and its NFS service. The exact
transports employed by the Linux NFS client for each mount point
depends on the settings of the transport mount options, which
include proto
, mountproto
, udp
, and tcp
.
The client sends Network Status Manager (NSM) notifications via
UDP no matter what transport options are specified, but listens
for server NSM notifications on both UDP and TCP. The NFS Access
Control List (NFSACL) protocol shares the same transport as the
main NFS service.
If no transport options are specified, the Linux NFS client uses
UDP to contact the server's mountd service, and TCP to contact
its NLM and NFS services by default.
If the server does not support these transports for these
services, the mount(8) command attempts to discover what the
server supports, and then retries the mount request once using
the discovered transports. If the server does not advertise any
transport supported by the client or is misconfigured, the mount
request fails. If the bg
option is in effect, the mount command
backgrounds itself and continues to attempt the specified mount
request.
When the proto
option, the udp
option, or the tcp
option is
specified but the mountproto
option is not, the specified
transport is used to contact both the server's mountd service and
for the NLM and NFS services.
If the mountproto
option is specified but none of the proto
, udp
or tcp
options are specified, then the specified transport is
used for the initial mountd request, but the mount command
attempts to discover what the server supports for the NFS
protocol, preferring TCP if both transports are supported.
If both the mountproto
and proto
(or udp
or tcp
) options are
specified, then the transport specified by the mountproto
option
is used for the initial mountd request, and the transport
specified by the proto
option (or the udp
or tcp
options) is used
for NFS, no matter what order these options appear. No automatic
service discovery is performed if these options are specified.
If any of the proto
, udp
, tcp
, or mountproto
options are
specified more than once on the same mount command line, then the
value of the rightmost instance of each of these options takes
effect.
Using NFS over UDP on high-speed links
Using NFS over UDP on high-speed links such as Gigabit can cause
silent data corruption
.
The problem can be triggered at high loads, and is caused by
problems in IP fragment reassembly. NFS read and writes typically
transmit UDP packets of 4 Kilobytes or more, which have to be
broken up into several fragments in order to be sent over the
Ethernet link, which limits packets to 1500 bytes by default.
This process happens at the IP network layer and is called
fragmentation.
In order to identify fragments that belong together, IP assigns a
16bit IP ID value to each packet; fragments generated from the
same UDP packet will have the same IP ID. The receiving system
will collect these fragments and combine them to form the
original UDP packet. This process is called reassembly. The
default timeout for packet reassembly is 30 seconds; if the
network stack does not receive all fragments of a given packet
within this interval, it assumes the missing fragment(s) got lost
and discards those it already received.
The problem this creates over high-speed links is that it is
possible to send more than 65536 packets within 30 seconds. In
fact, with heavy NFS traffic one can observe that the IP IDs
repeat after about 5 seconds.
This has serious effects on reassembly: if one fragment gets
lost, another fragment from a different packet but with the same
IP ID will arrive within the 30 second timeout, and the network
stack will combine these fragments to form a new packet. Most of
the time, network layers above IP will detect this mismatched
reassembly - in the case of UDP, the UDP checksum, which is a 16
bit checksum over the entire packet payload, will usually not
match, and UDP will discard the bad packet.
However, the UDP checksum is 16 bit only, so there is a chance of
1 in 65536 that it will match even if the packet payload is
completely random (which very often isn't the case). If that is
the case, silent data corruption will occur.
This potential should be taken seriously, at least on Gigabit
Ethernet. Network speeds of 100Mbit/s should be considered less
problematic, because with most traffic patterns IP ID wrap around
will take much longer than 30 seconds.
It is therefore strongly recommended to use NFS over TCP where
possible
, since TCP does not perform fragmentation.
If you absolutely have to use NFS over UDP over Gigabit Ethernet,
some steps can be taken to mitigate the problem and reduce the
probability of corruption:
Jumbo frames: Many Gigabit network cards are capable of
transmitting frames bigger than the 1500 byte
limit of traditional Ethernet, typically 9000
bytes. Using jumbo frames of 9000 bytes will allow
you to run NFS over UDP at a page size of 8K
without fragmentation. Of course, this is only
feasible if all involved stations support jumbo
frames.
To enable a machine to send jumbo frames on cards
that support it, it is sufficient to configure the
interface for a MTU value of 9000.
Lower reassembly timeout:
By lowering this timeout below the time it takes
the IP ID counter to wrap around, incorrect
reassembly of fragments can be prevented as well.
To do so, simply write the new timeout value (in
seconds) to the file
/proc/sys/net/ipv4/ipfrag_time
.
A value of 2 seconds will greatly reduce the
probability of IPID clashes on a single Gigabit
link, while still allowing for a reasonable
timeout when receiving fragmented traffic from
distant peers.