Red Hat GLOBAL FILE SYSTEM 4.7 Podręcznik Użytkownika Pobierz pdf (Strona 94)

User interaction with Lustre file systems6–10

Server-side timeouts

Server-side timeouts can occur as follows:

• When client nodes are connected to MDS and OST services in the HP SFS system, the client nodes

ping their server connections at intervals of one quarter of the period specified by the Lustre

timeout attribute. If a client node has not been in contact for at least 2.25 times the period specified

by the Lustre timeout attribute, the Lustre software proactively evicts the client node.

• If an RPC from the client node is an I/O request, the server needs to transfer data to or from the client

node. For this operation, the server allocates a timeout value of half of the value of the Lustre

timeout attribute. If an error occurs during transmission, or the transfer operation fails to complete

within the allocated time, the server evicts the client node.

When this happens, the next RPC from the client node receives a negative acknowledgement code

from the server to indicate that the client node has been evicted. This causes the client node to

invalidate any dirty pages associated with the MDS or OST service, and this in turn can lead to

application I/O errors.

Timeouts associated with lock revocations

It is possible to trigger timeouts unexpectedly as a result of the way that Lustre deals with locks and lock

revocations.

The Lustre software coordinates activity using a lock manager (LDLM). Each OST is the lock server for all

data associated with a specific stripe of a file. A client node must obtain a lock to cache dirty data

associated with a file, and at any given moment, a client node holds locks for data that it has read or is

writing. For another client node to access the file, the lock must be revoked.

When a server revokes a lock from a client node, all of the dirty data must be flushed to the server before

the time period allocated to the RPC expires (that is, half of the value Lustre timeout attribute). Issuing

a command such as ls -l on another client node in an active directory can be enough to trigger such a

revocation on client nodes, and thus trigger a timeout unexpectedly, in a borderline configuration.

When a lock revocation fails in this way, a message similar to the following is shown in the client node log:

2005/09/30 21:02:53 kern i s5 : LustreError:

4952:0:(ldlm_lockd.c:365:ldlm_failed_ast()) ### blocking AST failed (-110): evicting

client b9929_workspace_9803d79af3@NET_0xac160393_UUID NID 0xac160393 (172.22.3.147)

ns: filter-sfsalias-ost203_UUID lock: 40eabb80/0x37e426c2e3b1ac01 lrc: 2/0 , 0 mode:

PR/PR res: 79613/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0-

>18446744073709551615) flags: 10020 remote: 0xc40949dc40637e1f expref: 2 pid: 4940

Tuning Lustre timeout parameters

Several parameters control client operation:

• The Lustre timeout attribute value.

• Two parameters associated with client transactions to each OST service; these parameters are

important for write operations from the client node:

•The /proc/fs/lustre/osc/OSC_*/max_dirty_mb parameter is a client-side parameter

that controls how much dirty data can be created on a client node for each OST service. The

default value of this parameter is 32 (that is, 32MB).

•The /proc/fs/lustre/osc/OSC_*/max_rpcs_in_flight parameter controls the

number of simultaneous RPCs that can be outstanding to a server. The default value of this

parameter is 8.

These two parameters are used to keep the Lustre pipeline full in the case of write operations so that

maximum bandwidth can be obtained.

1 2 ... 89 90 91 92 93 94 95 96 97 98 99 ... 133 134

Komentarze do niniejszej Instrukcji

Brak uwag

Red Hat GLOBAL FILE SYSTEM 4.7 Podręcznik Użytkownika Strona 94

Komentarze do niniejszej Instrukcji

Powiązane produkty i podręczniki dla Podręczniki do oprogramowania Red Hat GLOBAL FILE SYSTEM 4.7