Red Hat GLOBAL FILE SYSTEM 4.7 Podręcznik Użytkownika Strona 105

  • Pobierz
  • Dodaj do moich podręczników
  • Drukuj
Przeglądanie stron 104
Operational issues 7–9
Preventing and correcting the problem
You can take action to prevent access to files hanging as described above; however, if you find that an
application has already hung, you can take corrective action. The preventive and corrective actions are as
follows:
Preventive action
If the ldlm_namespace_cleanup() message is seen on a client node, but the node is performing
normally without any visible hangs, reset the client node at the earliest available opportunity (when
the reset operation does not impact normal system operation). HP recommends that you reset the
node rather than performing a controlled shutdown; this is because unmount operations can cause
the LBUG error described above.
Corrective action
If you find that access to one or more files is hanging, you must reset all client nodes that have printed
an ldlm_namespace_cleanup() message since they were last rebooted. Note that you must reset
all such nodes, not just those nodes where the file is hanging or those nodes involved in the same job.
When the client nodes have been reset, wait for 10 minutes to allow the HP SFS servers enough time
to detect that the client nodes have died and to evict stale state information. After that, you will again
be able to access the file where the problem occurred.
If I/O access to the file still hangs after this delay, stop the Lustre file systems and then start them
again.
Please report such incidents to your HP Customer Support representative so that HP can analyze the
circumstances that caused the problem to occur.
7.3.6 Troubleshooting a dual Gigabit Ethernet interconnect
When a dual Gigabit Ethernet configuration is in place and a Lustre file system has been mounted on a
client node, there are a number of commands that you can use to verify the connectivity between client
nodes and the HP SFS servers and also to ensure that the connections are performing correctly.
To ensure that the client is aware of all of the server links that it needs be able to connect to, enter the lctl
command on the client node, as follows:
# lctl --net tcp peer_list
The output from the command varies depending on the configuration of the network. The format of the output
is as follows:
12345-server_LNET [digit]local_addr->remote_addr:remote_port connections
where:
server_LNET Shows the lnet: specification of the server.
local_addr->remote_addr Shows the addresses of the local and remote hosts.
remote_port Shows the port of the acceptor daemon; the default value is
988.
connections Shows the number of active connections to the peer. For each
peer, the correct number is three, because an outbound
connection, an inbound connection, and a control connection
are made for each peer.
Przeglądanie stron 104
1 2 ... 100 101 102 103 104 105 106 107 108 109 110 ... 133 134

Komentarze do niniejszej Instrukcji

Brak uwag