Home
Back - Support On Line
Contents
1. I W trooxo 612122 91 8100 0 8 8 T G29n t Tepmuempyur 27 2 T soooxo tl tit 2 91 3 04 v 8100 0 z T g2gnest oz p gt s tpw Qq pows z v 5 6 90 02002 3 04 8100 0 6 6 amp zpeumopxyur O d ciwW I W vrooxo TI ti BIsnez 91 3 034 L L I 020nez tj Tzsp2u stpaux z s42A0224uT p psumopxut 20070 I W otooto TI I LIsnez 91 3 04 2 10050 5 3 tognet 5 94032 Jut E paddoapsT rT spae stpaux 00070 2 DOwN 81000 1 020 91 5 cwovs T Y tI LIsnez z peddojpsr g spae stpaux 2 Dww 2 8100 0 5 5 91 9 t Y 3000x0 15 915122 2 Izpeumopyur z peddojpsr gT spae stpaux eil q Dowvs 2 100050 ti 90 02075 t 8 Ww ewws r v 3000 0 TI snz ppaddo 1 q pows z v 00 0 G ognast 91 T E TI zsnaz z peddojpspj p spae stpaux 10070 3 04 81000 TI tl 020 91 2 T 6000 0 Tl TI 82722 D z peddoapspj z spae sipuux Q Dovs 00 0 90 02075 2
2. ES090S6EBE 08092262615 5621967627 TLLSET 619079 9 88205 Ezreesess 5621967627 920909597 0 0190526 82 952721705 SEPEZ ZESOLPEG ETISIGTSTE 562196762 1788 0801 2566221 58219518 6 5 9588 Jd LINX 5607 5607 Scot 8 5607 560 652 5607 560 5507 tv 6 5 5607 5607 5507 5667 sz 81 Scot 81 or 560 1905 560 st 1560s 5607 205 es 1505 560 et 15085 t 1905 560 20 5 5607 16 1905 5607 5t 1905 le st 1505 560 20 5 569 1905 14 et 1505 560 I 15085 5607 1905 560 16 1505 5607 5 0 5 H Ispeumopxuri os t 95 560 t 95 992 1 9 5 1 8 t 95 NId Swoww3 xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv xv 033945 HLOIA Did AJY Q THIYY 2 00 0 7621700970145000 0 NOILY201 ar13Q0N zgI uoires 12 peaepdn 2 JI zuotragzr g20 pZ uog TL 218203 0 pouBrsse 30u pouStsse 218303 0 peispdn Asanby 429 ju tq pa40 1g201 fasn Butan pz 2718303 0 pz peaepdn Kaanbduz
3. 2 1 2 1 Maintenance Tools Overview 2 1 2 2 Maintenance Administration 2 2 2 2 1 Managing Consoles through Serial Connections conman ipmitool 22 2 2 2 Stopping Starting the Cluster nsclusterstop nsclusterstart sess 2 5 2 2 9 Managing atte 2 7 2 2 4 Remote Hardware Management CLI NS Commands 2 8 225 Managing System Logs syslog ng cueste s eet comis pastu reditu cai dts ttai 2 9 2 2 6 Upgrading Emulex HBA Firmware with lptools 2 14 2 3 Saving dnd Restoring the System BSBR 2 16 23 1 Installing s onis 2 16 252 Configuring 2 17 2 3 9 Backing upa Sy Stil 2 18 mE 2 19 2 4 Monitoring Maintenance erect poe tn vii Foi epis eid ctt reb odia ded 2 20 2 4 1 Checking the status of InfiniBand Networks ibstatus 00222 2 20 2 4 2 Diagnosing InfiniBand Fabric Problems IBS 2 22 2 4 3 Monitoring Voltaire Switches switchname 2 2 2 31 2 4 4 Getting Information about Storage Devices Isiocfg 2 34 245 Checking
4. 3 23 3 7 3 Jobs drei notgeting scheduled e ae e 3 24 34 4 are getting set to a DOWN 324 3 7 5 Networking and Configuration PIeblems onte aeria rhy a 3 25 3 7 6 More Dea 3 26 3 8 FLEXIm License Manager TIreubleshoofing sio E ebat oda t RR RE 3 27 3 8 1 Entering License File oec ar d ORO 3 27 3 8 2 Using the RECHNEN DAS ORC 3 27 3 8 3 Using INTEL LMD DEBUG Environment Variable 22 2 22 2 0 3 27 Chapter 4 Updating the firmware for the InfiniBand switches 4 1 4 1 Checking which Firmware Version is running eene 4 4 2 Configuring FTP for the firmware upgrades bao tbe 4 2 4 2 1 Installing the FTP Server M tan etc tec 4 2 4 2 2 Configuring the FTP server options the InfiniBand 4 3 4 3 Upgrading inei WEG oe wae pates it tu mena aay eaten Ph v ER tr 4 4 Chapter 5 Updating the firmware for the MegaRAID 5 1 vi BAS5 for Xeon Maintenance Guide Chapter 6 Accessing Updating and Reconfiguring the BMC Firmware on NovaScale RAxx machines cccseccsecceccesccnsccusccusce
5. 2 810050 til 2 gt 20 00070 SZ ODYM l v 000 0 TI ti 112122 E zspae stpaux a y Y 110050 2 00 91 X cwovH T 000 0 51 22 T peusopyury 2 2 l Qq Dovs 2 1000 0 L 0 ognast 91 T 8000 0 psn z g peumopyur 00070 3 cWDYvH l Y 110050 til ozsnez 91 8 DbwvW 110050 5 5 2 020 B cWDYH I W 910050 I gzsn z 91 8 DbwvW zv tooto 8 8 02gnezt Tepausmmwur g0g 8 Q cwovs I W 2000 0 T Tesnez 91 a y 110050 3 0 gnezt 700 0 q pows z v toooxo Tz 0 91 8 DOYvH Y 2100 0 0207 T paueopyut YA q pOvs t000x0 zi 90 02075 91 8 DOYN to0x0 St st z 0 gnest p Lzpeumopxuri 90 2 c YvN I W 510050 TI zzsnez 91 8 DDYN 2 110050 L L z 0 gnest 2 D eumopyur q povs 2 t000x0 et 6t I Q ognast 91 8 DOYN 110050 It Tt Q Tespamas paux 10070 2 Dows 81000 41 L 020 91 d cwovs T Y vrooxo tI TI anz 5 1 6 0 2z cwovw T 6000 0 Tl til 85122 91 3 134 zv 810050 TI t og amp
6. NOILJIN2S3U 223 12 501 u irez qp2asazni 2 sesseuppe JI ausudinbe qp4 3zni uotiezrjg20 3usudrnbs ZEPS 3z0 820 3504 uo qpjsizn 2 2 4 zas3uno314od ButuBtszy Aaenbysed 0330 2 gt 3 04 Aasnbj4asd 26 24 404 Burxoo Kusnbdus 0930 Buren s weuzsoy Burispdn Aaenbduz 404 BurxXoo z340d 53504 BI 3204 zpagog ztzzgQ2 uoirez Burae ndog zpagoq Butie ndog 53504 gI 613227 quix deyxaom3ey Burpgoqueog Q Q onest wos 22 9 2114 404 3s nb s 9 020 5 uoitez 24 0 gt 8 gnezt z sqr g 25022930042 saprepumq e Figure 2 2 Example of IBS command bandwidth action output BAS5 for Xeon Maintenance Guide 2 26 W i 2202223204 2 27 Day to Day Maintenance Operations suog qux deyx4o a32y suog 0 020 szuodza4 329 uog qux deyysoegey 2114 404 352162 uog 0 020 240442 R JNA 0 020 S e 2502203004 z peddoupsri g spue sipaux
7. T Y 8000 0 ssn z Tespowsstpiur 00070 Z 810050 8 8 T 0 onezt 3 cWovw T trooxo 14 615122 5 NOILY201 310W3M SWYNISOH 31039 NOILY201 SWYNLSOH sar yaz tes gp4932n12 eseqe ep 501 23 JI PZ qp4s3zn sseqezep 25 24 ausudinbs peaupdn Su0 T3285 11 2907 2935712 seqezep 1 8 20 3usudtinba uog zers 53504718307 3504 uo qp 123511 gt seqezep o3 13 32uuo TL 118303 0 poubtsse 542341023 404 PL 218303 Q 4276 124 0330 z4s3uno 3404 Asanbs sed utq ps 40 18201 Kasnb44ad 404 218303 0 pe te 0330 ssweuzsoy 9 Asanbdus 1 amp 201 4 amp n Asanbdus weaGoud 404 burxXoo sated 9303 suted ye poubtsse z34od Butz TL 218303 53504 gI 3404 0 5155942 9 15 4 423 tes But3e1ndog puaeoq zpagoq Buta3e ndog 218303 0 2102 8
8. PTAS PLE 3 Dovs Teva 2 T Y 92 230 UY T Y NOILY201 zv zv 442 t v v tz v zv v UY T Y 1 z 0 gnezt 2 020 0 onezt z 02onezt z 0 gnezt gnezt z 0 gnezt z 0 onezt 1 0 0 I 02gnezt T 0 gnezt T 030nezt I 02gnezt I 02gnezt T 0 gnezt I 020nezt gsnoz snaz zsn z SWYNLSOH 33187104 22183104 24183104 24183 0A 24183 0A 22183104 24183104 24363104 206351 206351 206351 206351 206351 206351 206351 206351 09206451 rzaewsI OPZOGYSI 206351 206351 206351 206351 206351 2 2 22 24183104 24183 0A 22183104 24183104 24183 0A SICSCIN 8125219 95022 2 2022 55022 8125219 0119 142530 44 112 14 25711669 ESSEPSPOFE OPLOS 8259255 58 26 4521 727796 5621967627 505 5621967677 5517 8058 562 96 62 PRSIESPZBE t5990558 192968 244548 2695 9 25 1091 545089 ADY 31034
9. 2 5 Gbps LENKES CAC ej paddle sto acne Active PhysSLinkState 4 224ce gs ag LinkUp LinkDownDefState Polling Dundee ee 0 0 1 amp 1 2 5 Gbps LinkSpeedEnabled 2 5 Gbps 2048 0 cp P VL0 7 woe AME OR Sede de 0x00 b in eer ete et ee tu dete itis 0 VEAP At GH CAG itd eaedem 8 VLAEDLOWCADSu 8 Ss ieee Se Avan e eee en 0x00 3 6 BASS for Xeon Maintenance Guide HEC OE ub deeem rete eve te s eL e arene E 2048 Rh caes 7 HO GT naue 13 VL0 7 de deere sub we 0 PartEnforceOutb eee 0 FilterRaWInbi c va re e ee 0 PilterbawOULtb3soeenc jue xU 0 0 PkeyViolatiohnsie si wee 0 OkeVVLOIACI ONS wear me xs 0 32 ClientReregister 0 18 1 8 me ms Local PRYSER E oner Rr es 15 mue PAD EHE 0 MaxCreditBErmtz co23 x RD ets 0 us 0 switchinfo example An example of
10. c lfslurmctld is running but not responding a very rare situation then kill and restart it as the root user using the following commands Service slurm stop service slurm start d If it hangs again increase the verbosity of debug messages by increasing SlurmctldDebug in the slurm conf file and restart Again check the log file for an indication of why it failed Troubleshooting 3 23 5 If SLURM continues to fail without an indication of the failure mode stop the service add the controller option c to the etc slurm slurm sh script as shown below and restart service slurm stop SLURM OPTIONS CONTROLLER c service slurm start Note All running jobs and other state information will be lost when using this option 274 Jobs not getting scheduled 1 This is dependent upon the scheduler used SLURM Run the following command to identify the scheduler scontrol show config grep SchedulerTyp See the Bull HPC Administrator s Guide for a description of the different scheduler types 2 For any scheduler the priorities of jobs can be checked using the following command scontrol show job 37A Nodes are getting set to a DOWN state 1 Check to determine why the node is down using the following command Scontrol show node name This will show the reason why the node was set as down and the time when this happened If there is insufficient disk space mem
11. 0 0 0 IEEE 98 0 T 8000 0 ssnaz q Dows zv 000 8 8 Q Q gnezt 1 2 3 04 Y 8100 0 51 SI T 0 onezt q movs 2 t000x0 St st 020 2 3 04 8100 0 1 Ft 1 020 z v to00x0 r l rl c 8 DOYM t00x0 z 0 gnest 2 t000x0 oz oz 0 020 20070 3 04 810050 6 6 Q powvs z v 00 0 6 5 0 020 Q 1254240224401 9 Wz cWovW T 210050 1 gsnez Q povs 2 1000 0 5 0 Q gnest T paueopyuty 3 DwW z Y 810050 ET T 0 gnest Z 100050 0 020 Izpeumopxut z spae stpaux 3 DwW Z 810050 st 1 020 q pows gv 00050 st st 0 0 0 x I peumopyurj z spom stpaux 8 DOYH t00x0 51 SI 20 Y t000x0 ez ez Gnas t 11 2 8 DOYvH Y 2100
12. 64 200206 gt 2009 0 2752170070145000 0 9 5z ewovw 1100 0 gsn z T Y3H Esnez vo 2 89720020622000 0 89720020622000 0 t 19 S T Y 55022 669720020622000 0 16694 200206 gt 2000 0 t 7 z peddogpsr z spje stpxux Z zWOYN T 200050 zenez SIZSCIM 12 164 209206 gt 2000 0 5164 200206 gt 2009 0 I t IEKE TE e sctroovOljSO00xG Swows3 NOILY201 3WYNISOH NOILA T9530 3dMAL 91193904404 a1n21u0d 1 93395 HLOIA SYOMNS 5300NIMOd 0In91MOd NId INOd 310N38 218303 0 102 55 6 51 0 NOILY201 peaepdn 501 uoites 12 2 22 JI 92 peagpdn 201 pZ uog 1 393 0 peuBizse 30u py TL 218203 G 1216 129 18201 fasn Butan Yr 19353 o yr psispon Asanbdus 1g201 fusn sated ge 18303 sated pg pouBtsse TL 218203 pe 9 1 5 42 Q spugog on 9606 51
13. Possible sources are as follows unix stream lt filename gt Stream pipes used in Linux file lt filename gt File data Linux kernel messages for example pipe filename Named pipes for interfacing with Nagios for example tcp lt ip gt lt port gt and udp lt ip gt lt port gt To listen on an address and a port internal syslog ng internal messages 2 10 BAS5 for Xeon Maintenance Guide destination Section This section defines the destination of the logs Syntax destination identifier destination driver params destination driver params etc The possible destinations are the following ones file lt filename gt To send to a file tcp lt ip gt lt port gt and udp lt ip gt lt port gt To send the logs on the network to another machine unix stream lt filename gt To send to stream pipes used in Linux userttyr lt user gt To send to the lt user gt consoles but only if this user is connected You can use the character to specify that the messages have to be sent to all users program lt commandtorun gt To send towards a program Examples You can specify several destination directives in a destination section as in the following example destination debug file var log debug log destination messages file var log messages log destination console usertty root destination xconsole pipe dev xconsole destination mail2admin
14. PZOGNSI 0 SOISY IZ YOH uog uog uog uog Toooxo epSzIPOGPOTjSOOOXO 712071 0 020 1 W OPZGENST SWYNLSOH NOILJIN2S30 qpasizni 2 9 501 uires gp4e3sn 12 ezEqeiep sesseuppe JI iusudinba 223 1 2 2 uoriezr g20 ausudinba ZEPS 3z0 820 3504 uo 9 423 1 2 za 3uno2iJod 121064229 0330 z4e3uno 3209 Asenbjsed wesboud 20 BurxXoo Aaenbduz 0330 s weuzsoy Buriepdn Asenbdus 20 404 1 z340d z3zoi gI 03 23209 Burubtzzy Spseog zrzzeu2 u irez Butrie ndog Butie ndog z3z0 gI Sutea qux 8 2gnest u ir z wos 2 109 22 4329 qur degxao jsy 2114 20 35 5 8 020 Buri uuo g 3N Q Q onezt z 591 g 2502203002 Figure 2 1 Example of IBS command topo action output BAS5 for Xeon Maintenance Guide 224 Use the command below to obtain the fabric topology using the data stored in the IBS database The hostnames and traffic counters are updated using the OFED tools ibs a topo
15. T gsn z E spae stpaux 8 DOYN 110050 t 0 gnest 91 I W woooxo O 8 DOYN 110050 TI I 0 gnest 91 T 0000 0 TI TI Zisnez 00070 Z 110050 8 8 2 020 91 T 910050 TI TI ezenaz 1 2 4 2 8 q pDows 00 0 s 5 9 020 91 T 000 0 tsm E g peddojps j g spae stpaux 00070 B DbYvW Z 110050 9 9 020 g ewovw I W 200050 il Il Tesnez 10070 9z cwovw l Y toooxo 1000 0 5 5 0 020 S I spog stpaux 00070 8 DOYH 110050 It z 0 gnezt 1900 0 et 0 Q gnest Ispeumopxuri r zspje stpuux 8100 0 ot 0 gt 20 2 t000x0 ot orl 0 020 00070 8 DOYN Y t00x0 st st 0 gnest q povs 2 t000x0 rz 8 020 1 YD 8 T 300050 TI I Lsnaz Z 100050
16. 00 013500050 51 et ST ST et It oooonoooooooooo i f i i8 A UD LO UO 9 93345 povs Ev NOILY201 0 0 2152170070135000 0 aradon 2152170070135000 0 2152170070135000 0 91821 00 0138000 0 215217000 38000 0 2152170070135000 0 152170070 135000 0 9152170070 13500050 2182170070 135000 0 152170070 38000 0 2152170070 133000 0 5152170070 138000 0 9152170070 13500050 9182170070 13500050 182170070 38000 0 0135000 0 2300N D404 79109 1304 Z aaou T p peuesopxuri 9 peusopyut Swoww3 71207 SWYNISOH 51 st I EI NId I30 1 0 9 21183 04 09206451 0114 192534 v gogora z G onezt 423125 5t l 9 05 5217007013000 0 TZ a rYw 000 0 2 020 22 1 09206351 423105 6 00 013800050 ST 905 2762170070145000 6 B DYW
17. 30000 z a onezt gt 1 09206351 423175 9761170070143000 0 9761190070145000 0 fI FT 11905 7521700 70145000 6 azos gogora 22123104 07206351 423105 57611700 0145000 0 9761170070143000 0 ET 80 2752170070145000 0 12 Y 000 0 z a gnezt 22123104 07206451 433105 97611970070145000 0 9761170070133000 8 752170070145000 0 02 a vv 2 gogoro z G onezt 22123104 07206351 231 9761170070145000 0 9761170070135000 0 II tu 119 o s 2762170070148000 0 a DOYW 8000 0 z a onezt 2213104 0720651 57611970070145000 0 9761170070135000 6 OT GI 305 762170070145000 6 BI a DYW z a onezt gt 1 OPZOGYSI 423175 9761170070148000 0 9761170070145000 0 6 16 l esixe 2752170070145000 0 T zv 60000 I G onezt 22123104 09206351 433175 2182170070143000 0 2182190070143000 0 5t 1905 erszlPGoPOl4SOO0xO eoooxo 020 1 GPZGEUSI u 3t s 2182170070145000 0 ST 205 ST Gogora 20 1 GPZGEUST gt 3196 21521700701350
18. Remote Hardware Management CLI Managing hardware power on power off reset status ping checking temperature changing bios etc syslog ng System log Management Iptools Iputils ibstatus ibstat Upgrading Emulex HBA Firmware Host Bus Adapter Backing up and restoring data based on mkCDrec Monitoring InfiniBand networks IBS tool Providing information about and configuring InfiniBand switches switchname Monitoring Voltaire switches Isiocfg Getting information about storage devices pingcheck ibdoctor ibtracert Checking device power state Identifying InfiniBand network problem crash proc kdump postbootchecker Table 2 1 Runtime debugging and dump tool Making verifications on nodes as they start Maintenance Tools Day to Day Maintenance Operations 2 1 2 2 2 2 1 2 2 1 1 2 2 Maintenance Administration Tools Managing Consoles through Serial Connections conman ipmitool The serial lines of the servers are the communication channel to the firmware and enable access to the low level features of the system This is why they play an important role in the system init surveillance or in taking control if there is a crash or a debugging operation is undertaken The serial lines are brought together with Ethernet Serial port concentrators so that they are available from the Management Node ConMan can be used as a
19. 1 R423 R440 and R460 machines 6 1 The Baseboard Management Controller BMC l See The Baseboard Management Controller BMC is used to monitor the hardware sensors for temperature cooling fan speeds power mode etc and to report any hardware errors by sending alerts It is also used for basic system management operations such as starting stopping and resetting a cluster It also provides a remote console on the cluster nodes via Serial over LAN access SOI The BMC is the intelligence in the Intelligent Platform Management Interface IPMI architecture The BMC manages the interface between system management software and platform hardware There are several ways to access the BMC of a machine Local access to the BMC The BMC of the local machine can be accessed using the ipmitool command Chapter 2 in this manual or the man page for more information The IPMI service must be started to access the local BMC via the IPMI driver Service ipmi start Examples 1 To obtain the BMC LAN configuration on a local NovaScale RA2x machine channel 1 run the command below ipmitool lan print 1 2 To obtain the BMC LAN configuration on a local NovaScale R440 or R460 machine channel 2 run the command below ipmitool lan print 2 Accessing Updating and Reconfiguring the BMC Firmware on NovaScale R4xx machines 6 1 6 1 2 Remote access to the BMC 6 1 2 1 Command Line Remote access The
20. 3 Direct Cache Access Intel R Virtualization Technology Intel EIST support KBC Clock Input Serial port A Base I O address Serial port A Device Interrupt Serial port A Configuration Serial port B Mode Base I O address Serial port B 2F8 Interrupt Serial port IRQ 3 DMI Event Event Logging Enabled Com Port Address On board COM B Baud Rate 115 2K Console Console Type VI 100 Redirection Flow Control None Console connection Continue C R after POST Hardware Fan Speed Control Modes eo Monitor System Event Logging Enabled Clear System Event Log Disabled SYS Firmware Progress Disabled PMI BIOS POST Errors Enabled BIOS POST Watchdog Disabled OS boot Watchdog Disabled Timer for loading OS min 10 Time out action No Action Security Supervisor Password 15 Clear User Password ls Clear Managing the BIOS on NovaScale R4xx Machines 7 15 Taser on boot Disabled 7 16 2 3 4 Boot 5 6 7 BASS for Xeon Maintenance Guide USB FDC USB CDROM USB KEY USB HDD USB LS120 PepperC Virtual disc PCI BEV IBA GE Slot 0500 v1270 IDE 2 WDC 1600 5 015 1 SO 7 4 6 NovaScale R423 BIOS Settings X7DWN R423 mainboard BIOS 1 1 7DWNC4308 BIOS setup section Advanced Main Boot Features Memory Cache PCI Configuration System Time System Date Legacy diskette A IDE Channel O Master IDE Channel Slave SATA Port O SATA Port 1 SATA Port 2
21. IBS command actions topo The topo action for the a option provides detailed topology details for the switch ibs s switch name a topo NE This will give output that includes a description of the switches the hostnames the GUID for the Nodes the LID for the Nodes the physical location of the switches The port details including any errors are shown in the bottom half of the screen for both local ports and for ports which are connected to remotely see the screen example on the next page Day to Day Maintenance Operations 2 23 q Espaddaops Q TOYN Q TOYN Q TOYN 9 zyoYs 9 A g mpagoztpaux A g spag stpaux ix z peddosps TA x peddoupsT A Suoww3 NOILY201 49 zv 2 2 zv UY UY UY Uv T Y JZ cWOYW T Y T Y 100050 Tesora Tesora 316070 100 0 00 0 010050 16070 taooxa 000 09 8 a gnezt 8 a gnezt Q gagnast 8 a gnezt a a onezr 8 a gnest 8 G gnest
22. Structure This guide is organized as follows Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Stopping Restarting Procedures Describes procedures for stopping and restarting Bull HPC cluster components Day to Day Maintenance Operations Describes how to undertake different types of maintenance operations using the set of maintenance tools provided with Bull HPC clusters Troubleshooting This chapter aims to help the user develop a general comprehensive methodology for identifying and solving problems on and off site Updating the firmware for the InfiniBand switches Describes how to update the Voltaire switch firmware Updating the firmware for the MegaRAID Card Describes how to update the firmware for the MegaRAID card Accessing Updating and Reconfiguring the BMC Firmware on NovaScale RAxx machines Describes how to update the BMC firmware NovaScale and R4xx machines systems Managing the BIOS on NovaScale R4xx Machines Describes how to update the BIOS on NovaScale R421 and R422 machines It also defines the recommended settings for the BIOS parameters on NovaScale R4xx machines Glossary and Acronyms Lists the Acronyms used in the manual Preface i Bibliography Refer to the manuals included on the documentation CD delivered with you system OR download the latest manuals for your Bull Advanced Server BAS release and for your cluster hardware from http
23. amp a onest 6Tsnez 812122 Tsnez 912122 115122 gsnez 85122 SWYNISOH 23704 0920635 1783104 0 206451 783104 0 206451 783104 07206451 183104 W fZGGNSI 183104 W OPZOGYSI 783104 M fCOGNSI 183104 072064 1 1 2 615122 1 2 81 2 1 2 215122 zoHtutjuI SICSCIM ZOHTUTjUI BICSCIM SOHTUTJUT BICSCIM zoHtutjuI SICSZIM 0114 125309 310N34 94183 0A fCOGHSI utes yates uates yates uates 1 5 752170070143000 0 521 00 013800050 v2 1 27521700701350006 27521700701350006 2752170070145000 6 2 52 1007014500050 2 52170070145000 0 27521700701350006 52 00 0138000 0 80935200206 gt 2000 0 ete3szoozoc zoooxo 664 200206 gt 2000 0 7869720020622000 0 8484 200206 gt 2000 0 84 200206 gt 2000 0 9769720020622000 0 Q1n5300N 1404 9 611 700 0135000 6 2 00 013500050 9752 1700701450000 52 1700 0145000 0 2752 1700 0145000 0 2752 1 700 01 5000 0 521 00 013500050 SPSZ TrOGrG 13500050 6093 lt 200206 gt 200050 11235200206 22000 0 5669700206 22000 0 558697200206 22000 0 11897200206 22000 0 5597200206 22000 1 64 200206 gt 22000 0 01191504 6
24. 2 17 N nb cpu total 2 44 nec admin command 3 15 nec admin conf file 3 15 NovaScale RA21 BIOS settings 7 6 NovaScale R421 E1 BIOS settings 7 9 NovaScale R422 BIOS settings 7 11 NovaScale R422 E1 BIOS settings 7 14 NovaScale R423 BIOS settings 7 17 NovaScale R440 BIOS settings 7 26 NovaScale R444 BIOS settings 7 24 NovaScale R460 BIOS settings 7 28 7 30 nsclusterstart command 1 4 2 5 nsclusterstop command 1 4 2 5 nsctrl command 1 1 1 2 2 7 O openib command 3 5 P perfquery command 3 7 phpPgAdmin interface 1 3 pingcheck command 2 37 postbootchecker 2 44 power state getting information 2 37 printk code 2 43 R Remote Hardware Management CLI 2 8 restoring the system 2 16 saving the system 2 16 SINFO command 1 1 SLURM troubleshooting 3 23 smpquery command 3 5 SOL Serial Over Lan 2 4 starting Backbone switch 1 3 Ethernet switch 1 3 HPC cluster 1 4 node 1 2 stopping Backbone switch 1 3 Ethernet switch 1 3 HPC cluster 1 4 node 1 1 storage troubleshooting 3 13 storage device getting information 2 34 storageadmin conf file 3 13 storioha command 3 20 stormap command 3 21 switchname command 2 31 syslog 3 21 syslog ng 2 9 syslog ng conf file 2 9 system logs managing 2 9 trace levels storage 3 13 Index 1 3 trace log storage 3 13 U troubleshooting FDA storage system 3 15 ulimit c
25. 5 and nova the I O nodes are novae et noval0 nova5 and nova6 have been de activated so their services have migrated to their pair nodes nova9 and nova10 lustre migrate nodestat HA paired nodes status node name node status HA node name node status 5 MIGRATED nova9 OK MIGRATED 10 OK Note This table is updated by the lustre check command lustre migrate hastat node name This command indicates how the Lustre failover services are dispatched after CSA software has been activated Each node has a view on the paired failover services the failover service dedicated to the node and the failover service dedicated to its pair node If the pair node has switched roles the owner column of the command output will show that this node supports the two lustre HA services In the following example nova and nova10 are paired I O nodes The 1ustre novae service is started on nova10 owner node This status is consistent on both novae and noval0 nodes BAS5 for Xeon Maintenance Guide lustre migrate hastat n nova 6 10 10 Member Status Quorate Group Member Member Name State ID nova6 Online 0x0000000000000001 novalO Online 0x0000000000000002 Service Name Owner Last State lustre 10 novalO started lustre nova6 novalO started Member Status Quorate Group Member Member Name State ID novalO Online 0x000000000000
26. Boot Features Power Button Behaviour Instant Off Resume On Modem Ring Off Power Loss Control Stay OFF Watch Dog Disabled Summary screen Disabled Cache System BIOS area Cache Video BIOS area Cache Base 0 512k Cache Base 512k 640k Cache Extended Memory Area Discrete MTRR Allocation PCI Onboard G LAN1 OPROM Configure Configuration Onboard G LAN2 OPROM Configure Default Primary Video Adapter Emulated IRQ Solution PCl e Performance Write Protect Write Protect Write Back Write Back Write Back Disabled Enabled Disabled Onboard Disabled Payload 256B Disabled Onboard First Disabled PCI Parity Error Forwarding ROM Scan Ordering PCI Fast Delayed Transaction Reset Configuration Data No Frequency for PCIX 1 2 MASS Auto Enabled Enabled Default Enabled Enabled Default Enabled Enabled Default Enabled Option ROM Scan Enable Master Latency Timer Option ROM Scan Enable Master Latency Timer Option ROM Scan Enable Master Latency Timer Option ROM Scan Enable Master Enabled Latency Timer Default SLOTA PCI Exp x8 Option ROM Scan Enabled SLOTI PCI X 100MHz SLOT2 PCI X 100MHz ZCR SLOT2 PCI X 100MHz ZCR SLOT3 PCI Exp x8 7 6 BASS for Xeon Maintenance Guide Enable Master Enabled RR Option ROM Scan Enabled SLOT5 x8 Enable Master Enabled Latency Timer Default SERR signal condition Single bit AGB PCI Hole Granularity 256 MB Memory Branch
27. Enabled SLOT6 PCI Exp x16 Enable Master Enabled SLOT1 PCI X 100 133MHz SLOT2 PCI X 133MHz Latency Timer Default large Disk Access Mode Advanced Chipset SERR signal condition Single bit Control Clock Spectrum Feature Disabled Intel VT for Directed I O Disabled AGB PCI Hole Granularity 256 MB Memory Voltage Auto Memory Branch Mode Interleave Branch O Rank Interleave 4 1 Branch Rank Sparing Disabled Branch 1 Rank Interleave 4 1 Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled Demand Scrub Enabled High Temp DRAM OP Disabled AMB Thermal Sensor Disabled Thermal Throttle Disabled Global Activation Throttle Disabled Force ITK Config Clocking Disabled Managing the BIOS on NovaScale R4xx Machines 7 21 Snoop Filter Enabled Crystal Beach Feature Enabled HD Audio Controller Auto Route Port 80h cycles to LPC Clock Spectrum Feature Disabled High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Frequency Ratio Default Core Multi Processing Enabled Machine Checking Enabled Fast String operations Enabled Thermal Management 2 Enabled C1 C2 Enhanced Mode Disabled Advanced Execute Disable Bit Enabled Processor Options Adjacent Cache Line Prefetch Enabled Hardware Prefetcher Enabled Set Max Ext CPUID 3 Disabled Direct Cache Access Disabled Intel R Virtualization Technology Disabled SMRR Control Disabled Intel EIST support Disabled KBC Clock Input 12MHz Serial port A Enab
28. NE Use the command below to dump the fabric topology using the local map file test NetworkMap xml and test portcounters csv The data read from these files is updated using the OFED tools ibs 1 f test NetworkMap xml c test portcounters csv a topo NE bandwidth The syntax for the bandwidth action is shown below This action is very useful when benchmarking in order to monitor the performance of switch and to identify any bottlenecks ibs s switch name a bandwidth NE Details of packets sent and received for the switch for both local and remote connections are displayed as shown in Figure 2 2 errors The errors action can be used to produce a short report containing details of the faulty links for a switch This is very useful for troubleshooting and will help to pinpoint any problems for the interconnects ibs s switch name a errors NE This will give output similar to that shown in Figure 2 3 EPM indicates the error rate in the form of Errors per Million packets sent See FAQ ID F10040 How to debug and clear InfiniBand fabric errors using FVM PM Counters CSV file available from www voltaire com for details of the different Port Counter error messages Day to Day Maintenance Operations 2 25 pawz e paddojpet paux ps T1 sna z 2poe zipuux Swows3 a y B DDYS B DIYM
29. Wait 10s before reconnecting if the connection failed Used when logs are centralized through network time reap number Closes a log file that is not accessed after number seconds log fifo size 1000 4 number of event lines stored before writing them Enables events to be taken quickly into account and to free the process that has generated them long hostnames off Usage of long names use dns no Usage of DNS to find addresses use fqdn no Usage of machine short name owner root logs owner group root logs group perm 644 logs rights mask keep hostname yes create dir yes Create directories for log storage use time recv no Local time will be used instead of the time written in the logs idle threshold 100 4 The garbage collector is started after 100 events if syslog ng is inactive busy threshold 100 4 The garbage collector is started after 3000 events if syslog ng is active source Section The source section defines the log source from the following network local files peripheral pipe stream Syntax source identifier source driver params source driver params etc For example the following lines are suitable for a Linux system They enable the dev log stream to be read and also to receive syslog ng internal messages and to handle kernel starting messages source src unix stream dev log internal file proc kmsg
30. Xeon Administrator s Guide for more information on SLURM and security Check that a consistent version of SLURM exists on all of the nodes by running one of the following commands sinfo V or rpm qa grep slurm If the first two digits of the version number match it should work fine However version 1 1 commands will not work with version 1 2 daemons or vice versa Errors can result unless all these conditions are true Each node must be synchronized to the correct time Communication errors occur if the node clocks differ Execute the following command to confirm that all nodes display the same time pdsh a date To check a group of nodes use the following command pdsh w node list date A matter of a few seconds is inconsequential but SLURM is unable to recognize the credentials of nodes that are more than 5 minutes out of synchronization See the Bull HPC BASS for Xeon Installation and Configuration Guide for information on setting node times using the NTP protocol More Information For more information on SLURM Troubleshooting see the Bull HPC 5 for Xeon Administrator s Guide Bull HPC BASS for Xeon User s Guide and http www lIn gov linux slurm slurm html BASS for Xeon Maintenance Guide 3 8 License Manager Troubleshooting 3 8 1 Entering License File Data You can edit the hostname on the server line first argument the port address third argument th
31. dbupdate Use the dbupdate action to update an existing IBSDB database In the example below the topology and traffic counter details for the iswuOcO 0 managed switch from the Management Node is updated using the OFED tools 2 29 Day to Day Maintenance Operations 2 4 2 3 2 30 ibs s iswu0c0 0 a dbupdate NE In order to ensure that the data is always up to date add the following line to the cron table using crontab 10 PATH usr local ofed bin PATH usr bin ibs s iswu0c0 0 a dbupdate vNE gt gt var log ibs log 2 amp 1 The traffic and error counters as well as the InfiniBand equipment stored in the IBS database will be refreshed every 10 minutes using the data supplied by the iswuOcO O switch Note user needs to know which switch is running the subnet manager as master for InfiniBand clusters that include multiple managed switches This switch should always be the one that is specified as the argument of the s flag Assuming that the data is refreshed by the cron daemon then if another switch becomes the subnet manager master the data details contained in the database would then be incorrect as it would use data from what is the slave switch as defined in the cron script Use the sminfo command as follows to know which subnet manager is running as the master Output in a form similar to that below will be provided sminfo sm lid 1 sm guid 0x8f1040041254a activity
32. increasing SlurmdDebug in the slurm conf file and restart Again check the log file for an indication of why it failed 8 If the node is still not responding without an indication as to the failure mode stop the service add the daemon option c to the etc slurm slurm sh script as shown below and restart Service slurm stop SLURM OPTIONS DAEMONS c service slurm start Note All running jobs and other state information will be lost when using this option 2475 Networking and Configuration Problems 1 Use the following command to examine the status of the nodes and partitions sinfo all 2 Use the following commands to confirm that the control daemons are up and running on all nodes scontrol ping scontrol show node 3 Check the controller and or slurmd log files SlurmctldLog and Slurmdlog in the slurm conf file for an indication of why a particular node is failing Troubleshooting 3 25 34445 3 26 Check for consistent slurm conf and credential files on the node s experiencing problems If the problem is a user specific problem check that the user is configured on the Management Node as well as on the Compute Nodes The user does not need to be able to login but his user ID must exist User authentication must be available on every node If not non root users will be unable to run jobs Verify that the security mechanism is in place see chapter 6 in the Bull HPC BASS for
33. is already running and you have tried to start it twice Sometimes it means that another program is using this TCP port number The number is listed on the SERVER line in the license file as the last item You can change the number and restart Imgrd intel but only do this if you do not already have an Imgrd intel running for this license file Troubleshooting 3 27 3 28 INTEL cannot initialise INTEL FLEX1m version 7 2 lmgrd Please correct problem and restart daemons You may be starting the Imgrd intel from the wrong directory or with relative paths Use the following lines in the start up and add a full root path to INTEL to the end of the VENDOR line in the license file cd installation directory pwd lmgrd intel c pwd server lic 1 pwd l1mgrd intel log License manager cannot initialize Cannot find license file You have started Imgrd intel on a non existent file The recommended way to specify the file for Imgrd intel to use c license cd installation directory pwd lmgrd intel c pwd server lic 1 pwd lmgrd intel log Invalid license key inconsistent encryption code for FEATURE This happens for 3 different reasons 1 The license file has been typed in incorrectly Cutting and pasting from email is a safe way to avoid this Or the data have been altered by the end user See Entering License File Data above 2 The license is generated incorrectly Your vendor will have
34. n all lustre util stop f all lustre migrate hastop service ldap stop storedepha c stop a To stop the ganglia service run the following commands service gmond stop service gmetad stop To stop the postgresql service run the following commands service postgresql stop Do not forget to restart the stopped services after the backup is complete 2 3 3 3 Creating the Backup Carry out these operations on the Management Node 1 Logon as the root user preferably in single mode 2 Goto the base directory by default this is var opt mkcdrec cd var opt mkcdrec 2 18 BAS5 for Xeon Maintenance Guide 3 Check that the system is operational make test Warning messages are displayed if some elements are missing for the backup If this happens make the appropriate corrections and restart make test until the test is successful Note Ignore the bin mt not found warning message issued by test2 if there is no tape drive 4 Launch the backup operation bsbr A menu is displayed similar to that below Enter your selection 1 Create rescue CD ROM only no backups 2 Create ISO backup images in tmp to burn on CDROM or DVD 3 Create backup on disk mounted harf disk NFS mount point SMB mount point 4 Create backup on tape device dev nstO 5 Quit Please choose from the above list 1 5 Select one of the options displayed 1 to 5 an
35. support bull com The Bull BASS for Xeon Documentation CD ROM 86 A2 91EW includes the following manuals e HPC 55 for Xeon Installation and Configuration Guide 86 A2 87EW e HPC BASS for Xeon Administrator s Guide 86 A2 88EW e Bull HPC BASS for Xeon User s Guide 86 A2 89EW e HPC BASS for Xeon Maintenance Guide 86 A2 90 e Bull HPC BASS for Xeon Application Tuning Guide 86 A2 16FA e Bull HPC 55 for Xeon High Availability Guide 86 A2 21 The following document is delivered separately e Software Release Bulletin SRB 86 A2 71 The Software Release Bulletin contains the latest information for your BAS delivery This should be read first Contact your support representative for more information In addition refer to the following e Voltaire Switches Documentation CD 86 2 79ET e NovaScale Master documentation For clusters which use the PBS Professional Batch Manager e PBS Professional 9 2 Administrator s Guide on the PBS Professional CD ROM e PBS Professional 9 2 User s Guide on the PBS Professional CD ROM Highlighting e Commands entered by the user are in a frame in Courier font as shown below mkdir var lib newdir e System messages displayed on the screen are Courier New font between 2 dotted lines as shown below e Values to be entered in by the user are in Courier New for
36. 0 et 2 0 0 2 t000x0 8t atl 09 02005 2 8 DOYM t00x0 Prl Ft 0 q vovs t000x0 22 zz 0 020 I peumopxur z spje stpaux 8 DOYH 110050 6 6 z 0 gnest 2 toooxo et etl 0 020n 2t Izpeumopxur r s494022 u1 9 spje stpuux 2250 O cwovN I W 8000 0 TI I ysnaz z v 1000 0 L L Q Q gnest i 04070 8 DOYvH Y t00x0 ET 2 0 0 q povs 2 1000 0 Tz ognast CN z Y 810050 It 1 020 q powvs 2 0050 It 020 y Izspamostpaur z 442424 00070 8 DOYH 210050 41 4 2 2 ewovw 17711 510050 tI T zzsn z g peddojpsp p spae stpuux 00070 3 04 81000 zl z 1 020 szE 0 22 T soooxo tl TI 2 Ezspae stpaux 8 DOYM 110050 z 0 on st 00070 T Y 9000 0 OD 00070 8 DOYN tooro s 5 z 0 onest 00070 U v tooo Tl TI ozsnez I peumopxurl z peddojpsrj zr spae stpuux se a tove 8 8 9 020 O
37. 37 32 9 dgbnetdiseoverand ibehecknelus sss ep EG APR ADDED RO ST REO BI ERE 3 9 3 2 4 ibcheckwidth and 202 2 2 2 2 3 10 225 3 10 3 3 Node Deployment es See 3 11 3 3 1 en deployment accounting osos itecto etus e pate pd eiua wi ertet 3 11 3 3 2 Possible Deployment Problems 222 2 99 3 11 3 4 Storage NoublesHoolihig 3 13 3 4 1 Management Tools Troubleshooting GE Feo eres 3 13 3 5 Lustre Troubleshooting utei tu tu ovde 3 16 mar une ED pita cts eden P Ree 3 16 3 5 2 Suspected File System BUD cssc 3 16 3 5 3 Cannot re install a Lustre File System if the status is 3 16 3 5 4 Cannot create file from cliert n assi ec DG e 3 17 24552 No 3 17 240 Lustre File System High Availability Troubleshooting 22222 2 3 18 3 6 1 On the Management Node ao cr ve MS RU 3 18 3 6 2 Nodes of an 3 20 3 7 Treubleshoolng aoa vans dt a maa 3 23 Su 2 SIURM nastro oust bets o ee oe 3 23 227027
38. Disabled Configure SAS as SW RAID Disabled Serial A Enable Enabled Address 3F8 Serial Ports IRQ 4 Configuration Serial B Enable Enabled Address 2F8 IRQ 3 USB Controller Enabled Legacy USB Support Enabled USB Port 60 64 emulation Disabled Configuration Device reset Timeout 20s Storage Emulation Auto USB 2 0 Controller Enabled PCI Memory mapped I O start addr 2 00GB Configuration Memory mapped O above 4GB Disabled Onboard video Enabled Dual Monitor Video Disabled Onboard NIC1 ROM Enabled Onboard NIC2 ROM Disabled Module NIC ROM Disabled Managing the BIOS on NovaScale R4xx Machines 7 9 accoustic amp Perf Security Server Management 7 10 Boot Options Throttling mode Administrator password User Password Front panel lockout Assert NMI on SERR Assert NMI on PERR Resume on AC Power Loss Windows hw error architecture FRB 2 Enable OS boot Watchdog BMC PLUG amp Play detection Console Redirection Boot Timeout Boot Option 1 Boot Option 2 Boot Option 3 Boot Option 4 Hard Disk Order Boot Option Retry BASS for Xeon Maintenance Guide Console Redirection Flow Control Baud Rate Terminal Type legacy OS Redirection hard disk 1 hard disk 2 hard disk 3 Network Device Order 22122 d network device 2 Enabled Closed Loop Not Installed Not Installed Disabled Enabled Enabled Stay off Enabled Enabled Disabled Disabled Ser
39. General Public License GUI Graphical User Interface GUID Globally Unique Identifier Host Bus Adapter HPC High Performance Computing IPMI Intelligent Platform Management Interface Glossary and Acronyms G 1 K KSIS Utility for Image Building and Deployment L LAN Local Area Network LDAP Lightweight Directory Access Protocol LUN Logical Unit Number M MAC Media Access Control address MPI Message Passing Interface N NFS Network File System NIS Network Information Service NS NovaScale NTP Network Type Protocol P PCI Peripheral Component Interconnect Intel 2 595 for Xeon Maintenance Guide R RAID Redundant Array of Independent Disks 5 SCSI Small Computer System Interface SLURM Simple Linux Utility for Resource Management SMP Symmetric Multi Processing SMT Symmetric Multi Threading SNMP Simple Network Management Protocol SOL Serial Over LAN SSH Secure Shell T TCP Transmission Control Protocol TFTP Trivial File Transfer Protocol U UDP User Datagram Protocol USB Universal Serial Bus W WWPN World Wide Port Name Glossary and Acronyms G 3 4 BAS5 for Xeon Maintenance Guide Index etc clustmngt nsclusterstart conf file 2 5 etc clustmngt nsclusterstop conf file 2 5 proc file 2 41 B BIOS update 7 1 BMC firmware update 6 1 bootable system image Se
40. Mode Interleave Branch O Rank Interleave 4 1 Branch Rank Sparing Disabled Branch 1 Rank Interleave 4 1 Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled Advanced High Bandwidth FSB Enabled Chipset High Temp DRAM OP Disabled Control AMB Thermal Sensor Disabled Thermal Throttle Disabled Global Activation Throttle Disabled Crystal Beach Feature Enabled Route Port 80h cycles to LPC Clock Spectrum Feature Disabled High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Frequency Ratio Default Core Multi Processing Enabled Machine Checking Enabled Thermal Management 2 Enabled Advanced C1 Enhanced Mode Disabled Processor Execute Disable Bit Enabled Options Adjacent Cache Line Prefetch Enabled Hardware Prefetcher Enabled Direct Cache Access Disabled Intel R Virtualization Technology Disabled Intel EIST support Disabled KBC Clock Input 12MHz Serial port A Enabled Base I O address Serial port 3F8 Interrupt Serial port A IRQ 4 Device Serial port B Enabled Configuration Mode Normal Base I O address Serial port B 2F8 Interrupt Serial port B IRQ 3 Floppy disk controller Enabled Base I O address Primary DMI Event Event Logging Enabled Console Com Port Address On board COM B Redirection Baud Rate 115 2K Console Type VT 100 Managing the BIOS on NovaScale R4xx Machines 7 7 Flow Control None Console connection Direct Continue C R after POST On Hardware CPU Temp
41. Reboot the machine Note BIOS setup configuration file for a particular BIOS version can only be restored a machine which uses the same BIOS version 7 3 Updating the BIOS on NovaScale R440 or R460 platforms The BIOS update on these platforms is done through the Bull Update BIOS CD that allows upgrading the BIOS BMC firmware and FRUs Please follow the instructions provided with the CD 7 4 BAS5 for Xeon Maintenance Guide 7 4 BIOS Parameter Settings for NovaScale Rxxx Nodes The BIOS parameter settings for the NovaScale R421 R421 R422 R422 E1 Compute Nodes and R440 R460 R423 Service Nodes will normally be configured in the factory before the machines are delivered However if the cluster set up is changed the following settings can be used to reset the machines back to their original state Notes settings shown in the tables are the default values The parameter values that have to be changed for HPC are indicated in bold e Some of these settings for example for the storage will vary according to the cluster and will differ from the settings shown in the tables and screen grabs 7 4 1 Examples n EE Boot Features Item Specific Help Allows the system to QuietBoot Mode skip certain tests POST Errors Enabled while booting This will decrease the ACPI Mode Yes time needed to boot Power Button Behavior Instant Off the system Re
42. Stop the node From the management node enter nsctrl poweroff node name This command executes an Operating System OS command If the OS is not responding it is possible to use nsctrl poweroff force node name Wait for the command to complete 4 Check the node status by using nsctrl status node name The node can now be examined and any problems which may exist diagnosed and repaired Stopping Starting Procedures 1 1 1 12 Restarting Node To restart a node enter the following command from the management node nsctrl poweron node name Note If during the boot operation the system detects an error temperature or otherwise the node will be prevented from rebooting Check the node status Make sure that the node is functioning correctly especially if you have restarted the node after a crash e Check the status of the services that must be started during the boot The list of these services is in the etc rc d file e Check the status of the processes that must be started by a cron command e mail server syslog ng and ClusterDB must be working e Check any error messages that the mails and log files may contain Restart SLURM and the filesystems If the previous checks are successful reconfigure the node for SLURM and restart the filesystems 1 2 BAS5 for Xeon Maintenance Guide 1 2 Stopping Restarting an Ethernet Switch e Power off the Ethernet
43. Yes Yes storframework conf stormap Yes Yes stormodelctl Yes Yes storframework conf storregister Yes Yes storframework conf storstat Yes Yes storframework conf stortrapd No Yes storframework conf stortraps No Yes storframework conf Table 3 1 Available troubleshooting options for storage commands nec admin Command for Bull FDA Storage Systems The admin command is used to manage Bull FDA Storage Systems This command interacts with the FDA CLI A retry mechanism has been implemented to manage the fact that the may reject commands when overloaded If despite default setting the nec admin command occasionally fails you may change the timeout and retry values defined in the etc storageadmin nec admin conf file Number of retries in case of iSMserver Busy Not Mandatory retry 3 If retry is set time in second between two retries Not Mandatory rtime 5 Timeout value when timeout is reached the command is considered as failed If number of retries does not exceed the retry value the command is launched again otherwise it is failed cmdtimeout 300 See BASS for Xeon Administrator s Guide for more details about the nec_admin command Troubleshooting 3 15 3 9 3 5 1 2 2 2 32 5 9 3 16 Lustre Troubleshooting The following section helps you troubleshoot some of the problems affecting your Lustre file system Because typographic errors in your configuration script o
44. a line board or a fabric board is replaced always ensure that it is using the correct firmware 4 4 BASS for Xeon Maintenance Guide 3 Check that the firmware has upgraded correctly by running the firmware verify anafa Il command switchname utilities firmware verify anafa II Updating the firmware for the InfiniBand switches 4 5 4 6 BAS5 for Xeon Maintenance Guide Chapter 5 Updating the firmware for the MegaRAID card The MegaRAID SAS driver for the 8408E card is included in the BASS for Xeon delivery The MegaRAID card will be detected and the driver for it installed automatically during the installation of the 5 for Xeon software suite The tool used to update the firmware for the MegaRAID card and is available on the Bull support CD The latest firmware file should be downloaded from the LSI web site Follow the procedure described below to update the firmware 1 Check the version of the firmware already installed by running the command opt MegaCli AdpAllInfo a0 This will provide full version and manufacturing date details for the firmware as shown in the example below Adapter 0 Versions Product Name MegaRAID SAS 8408E Serial No P088043006 FW Package Build 5 0 1 0053 Mfg Data Mfg Date 01 16 07 Rework Date 00 00 00 Revision No 2 Image Versions In Flash Boot Block Version R 2 3 2 BIOS Version MT25 MPT Versi
45. console management tool See 2 2 1 1 Using ConMan e ipmitool allows you to use a Serial Over Lan SOL link See 2 2 1 2 Using ipmi Tools Note Storage Units may also provide console interfaces through serial ports allowing configuration and diagnostics operations Using ConMan The ConMan command allows the administrator to manage all the consoles including server consoles and storage subsystem consoles on all the nodes It maintains a connection with all the lines that it administers It provides access to the consoles and uses a logical name It supports the key sequences that provide access to debuggers or to dump captures Crash Dump ConMan is installed on the Management Node The advantages of ConMan on a simple telnet connection are as follows e Symbolic names are mapped per physical serial line e is a log file for each machine e ltis possible to join a console session or to take it over e are three modes for accessing the console monitor read only interactive read write broadcast write only Syntax conman OPTIONS lt CONSOLES gt b Broadcast to multiple consoles write only d HOST Specify server destination 127 0 0 1 7890 e CHAR Specify escape character amp Force connection console stealing BAS5 for Xeon Maintenance Guide F FILE Read console names from file h Display this help file Join connection console sharing FILE Log
46. local R421 7 3 7 2 3 Installing a new BIOS on a remote R421 7 4 7 2 4 Installing the Bull HPC BIOS Setup Configuration File on a remote R421 E1 machine 7 4 7 3 Updating the BIOS NovaScale R440 or R460 platforms 7 4 7 4 BIOS Parameter Settings for NovaScale Rxxx 2 2 7 5 ZAT 2EXxamplesss IER nudis aeta ene q 7 5 7 4 2 NovaScale R421 BIOS Settings oiov e ERA 7 6 7423 NovaScale R421 E1 BIOS Settings stet ood bate eee mate 7 9 7 4 4 NovaScale R422 BIOS hh dr 7 11 7 4 5 NovaScale R422 E1 BIOS Settings ERES ERI ON EDEN SE 7 14 7 4 6 NovaScale R423 BIOS Settings emet exce OD ORE 7 17 7 4 7 NovaScale R425 BIOS Settings 7 20 FAB NovaScale R440 SATA BIOS Settings 7 24 FAD NovaScale R440 SAS BIOS Settings ep OO ERROR SG es 7 26 7 4 10 NovaScale R460 BIOS Settings 5 7 28 7 4 11 NovaScale R480 E1 BIOS Settings cius ere ends Modes tac esed 7 30 Gloss ry and RD G 1 ifo ES TR RD DS 1 Table of Contents vii List of Figures Figure 2 1 Figure 2 2 Figure 2 3 Figure 3 1 Figure 7 1 Figure 7 2
47. nsm calls before waiting the period defined by the option jobs Number of simultaneous actions for example with 5 you can run 5 simultaneous nsmpower processes Default 30 only_test o Display the NS Commands that would be launched according to the specified options and action This is a testing mode no action is performed time t Time to wait after the number of nsm calls defined by the interval option verbose v Verbose mode Parameters Type device type Type of devices to be pinged disk array or server command on or off devices Specify the name of the devices using the basename i j k or syntax Examples e following command verifies that all the power supplies for disk array 10 to 15 are in on state and indicates those which are not pingcheck Type disk array on da 10 15 e following command verifies that servers novas to 7 are in off state and indicates those which are not pingcheck Type server off nova 5 7 Day to Day Maintenance Operations 2 37 2 5 2 2 1 2 5 2 2 5 2 1 OUT bali4 HCA 1 INTO ISR9024D Voltaire OUT ISR9024D Voltaire INTO bali23 HCA 1 2 38 RACK2 114 0x14 port 1 guid 0002c90200234144 state Active width rate RACK2 114 Oxle port 1 guid 0002c902002341b1 state Active width rate Debugging Maintenance Tools Modifying the Core Dump Size By default t
48. program usr bin MailToAdmin destination full file dev tty12 file var log full log log fifo size 2000 You can add specific options such as 104 size 2000 as shown the example above In the following example all the logs will be sent to the Management Node whose address is 192 168 0 100 destination central log tcp 192 168 0 100 port 514 Using Macros It may be useful to use macros to set intelligible names for your destination files Predefined macros exist such as FACILITY PRIORITY or LEVEL DATE FULLDATE ISODATE YEAR MONTH DAY HOUR MIN SEC FULLHOST HOST Some examples are below destination full 4 file dev tty12 file var log full SDAY SMONTH S YEAR log owner root group adm Day to Day Maintenance Operations 2 11 2 12 destination hosts file var log HOSTS SHOSTS SFACILITY SYEAR SMONTH SDAY SFACILITYSYEAR SMONTHSDAY owner root group adm perm 0600 dir_perm 0700 create_dirs yes Note Do not forget to remove or archive older files regularly filler Section This section describes the filtering mechanism for events Syntax filter identifier expression The filters are defined by the following keywords facility facility facility To filter by type level pri pril pri2 pri3 filter by priority or level program regexp To filter by the na
49. related events can be generated by both the PM Performance Monitor and by the SM Subnet Manager The PM periodically scans the error counters of all IB elements in the fabric and reports if a counter exceeds its threshold The SM monitors the fabric detects configuration changes and dynamically configures the new elements and new routes in the fabric The SM can detect fabric errors warnings informative events and report them Both the PM and the SM generate events and report them to the event notification mechanism In addition events may be generated in the fabric and sent to the SM by fabric elements The SM reports those events as well The event mechanism can do the following actions with each event a Log the event in the event log b Issue a trap to the GUI session c If the event corresponds to an alarm it is also sent to the current alarm mechanism The GUI Color coding is defined according to traps and events severity as described below Critical Critical means that the system or a Invalid link Duplicate or Major system component fails to operate conflicting ports or path Yellow Warning Warning minor reflects a problem in the Broken link Illegal connections Minor fabric but does not prevent its operation between two sLB ports A warning is asserted when an event is exceeding a predefined threshold 3 4 Normal Information Notification provided to the Complete subnet user of normal
50. the port guids of mthcaO enter ibstat p mthcaO0 e Tolist all CA names enter ibstat 1 Day to Day Maintenance Operations 2 21 2 4 2 Diagnosing InfiniBand Fabric Problems IBS tool This tool is used from the Management Node to diagnose problems for InfiniBand fabric using the cluster switch topology information contained in the NetworkMap xml file and the error checking counters contained in the PortCounters csv file Alternatively an IBS database IBSDB containing all the switch information can be created and then used as the data source to diagnose the problems Command syntax ibs a action hvCNE s lt switch gt lt networkmap gt lt lt counters gt The following options are available for the ibs command 4 y Verbose mode C Disable colored text output Action of topo bandwidth errors config group dbpopulate availability dbcreate dbdelete dbupdate dbupdatepc OFED related options When working from the cluster Management Node and provided this node is fitted with an InfiniBand adapter that is connected to an InfiniBand interconnect it is recommended that the N and E options are used as the OFED software view of the cluster is more reliable than that provided by data taken directly from the switch N Query the IB subnet manager to obtain and update the hostname details E Query the IB subnet manager to obtain and update data u
51. vendor daemon resides VENDOR INTEL root directory path The Imgrd intel daemon MUST be started with the c argument cd installation directory pwd lmgrd intel c pwd server lic 1 pwd l1mgrd intel log Application Execution Problems Cannot connect to license server Usually this means the server is not running It can also mean the server is using a different copy of the license file which has a different port number than the license file you are currently using indicates You can use the Imdiag utility to more fully analyze this error License Server does not support this Feature This means the server is using a different copy of the license file than the application They should be synchronized This error will also report UNSUPPORTED in the debug log file Invalid Host You may be attempting to run the application on a host not listed in the HOSTID field of your license Use Imhostid to find the hostid number for the current host Cannot find license file No such file or directory Expected license file location path The application was not able to find a license file It gives you the location s where it was looking for a license file Check that the named file exists To use a file at a different location use the environment variable INTEL LICENSE FILE No such Feature exists The license manager cannot find a FEATURE line in the license file Troubleshooting 3 29 Feat
52. 00 0 2182170070135000 6 FT 90 01 8000 0 FT CY 00 0 20 21193104 GPZGEUSI 423105 ISZIPG0POTJ8000x0 2182170070145000 6 ET 80 62170070145000 0 3 rOvw 60000 l G onezt gt 1 09206351 u 3i s 2182170070148000 0 2182190070145000 0 ZT zl 1905 e SclPGoPOlJSO00xO ZT 21183154 GPZGEYSI PISTTPGGPGTJSGG0xG 2152170070143000 0 IT I1 2762170070145000 0 II Gogora 20 2213104 09206351 2182170070145000 0 00050 GT 119065 752170070145000 6 3 004 Z Y Gogora I G onezt 22123194 07206451 u 3t s 152170090135000 0 2152190070135000 6 6 1 6 9 o s 521 0070135000 0 6 1 5 2825 2 1 zNOYW T Y 91008 ssnez SOHTUTJUT SICSCIM 12 9264 200206 gt 2000 0 1269720020622000 0 Ilo9os xr Z pouropyut 762170070145000 0 8 lezpagoztipaux z psddogpsq T Y venez SIZSZIM 12 8 64 200206 gt 2000 0 6 64 20020622000 0 It Iloos xr I peueopxur 1 8 00 Nz cwovw l Y 95022 95 22 v2 2569 20020622000 0
53. 0002 nova6 Online 0x0000000000000001 Service Name Owner Last State lustre 10 novalO started lustre nova6 novalO started To return to the initial configuration you should stop lust re_nova which is running 10 and start it on nova6 using the lustre migrate relocate command lustre util status This command displays the current state of the Lustre file systems E brem Sometimes this command can simply indicate that the recovery phase has not finished in this situation the status will be set to WARNING and the remaining time will be displayed When an I O node have been completely re installed following a system crash the Lustre configuration parameters will have been lost for the node They need to be redeployed from the Management Node by the system administrator This is done by coping all the configuration files from the Management Node to the I O node in question by using the scp command as shown below Scp etc lustre conf fs name xml io node name etc lustre conf fs name xml fs name is the name for each file system that was included on the I O node before the crash lustre util info This command provides detailed information about the current distribution of the OSTs MDTs The services and their status are displayed along with information about the primary secondary and active nodes 3 19 Troubleshooting 3 6 2 3 20 tmp log lustre lustre_HA ddmm log This f
54. 0002c9020023440d state Active width 4X rate 5 0 ISR9024D M Voltaire 11 0 1 port 0 guid 0008 10400411 54 state Active width 4X rate 2 5 lid 0x2 port 3 guid 0008 10400411 state Active width 4X rate 5 0 lid 0x2 port 6 guid 0008 10400411 state Active width 4X rate 5 0 2 2 2 2 ibtracert Command ibtracert uses Subnet Manager Protocols 5 to trace the path from a source GID LID to a destination GID LID Each hop along the path is displayed until the destination is reached or a hop does not respond By using the mg and or ml options multicast path tracing can be performed between the source and destination nodes Syntax ibtracert options lt src addr gt lt dest addr gt Flags Simple format no additional information is displayed m mlid Show the multicast trace of the specified mlid Examples e To show trace between lid 2 and 23 enter ibtracert 2 23 e To show multicast trace between lid 3 and 5 for mcast lid 000 enter ibtracert m 0xc000 3 5 Day to Day Maintenance Operations 2 39 Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps 2 40 Output The output for a command between two points is displayed in both hexadecimal format and in human readable format as shown in the example below for the trace between the two lids 0x22 and Ox2c This is very useful in helping to i
55. 049885e 4 12 UP osc OSC nova9 ost nova6 ddn0 9 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 13 UP osc OSC nova9 ost novalO0 ddn0 15 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 14 UP MDC nova9 nova5 ddn0 25 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 The last line indicates the state of the MDC which is the client connecting to the MDT on the MDS The other lines indicate the state of the OSC which are the clients connecting to each OST on the novae and nova10 OSS var log lustre HA yy mm dd log This file provides a trace of the calls made by CS5 to the Lustre failover scripts Note In the yy mm dd log file yy specifies the year mm the month and dd specifies the day of the creation of the file var log syslog This file provides a trace of the events and activity of CS5 and Lustre Troubleshooting 3 21 Pair Node Consistency In some very specific cases it may be necessary to reset the HA system to a state which ensures consistency across the pair nodes without stopping the Lustre system 1 Disconnect s1 Lustre File System from the HA system lustre ldap unactive f fsl 2 Run clustat to view the location of the services clustat 3 Perform one of the following actions switch a node from
56. 256 MB Memory Branch Mode Interleave Branch Rank Interleave 4 1 Branch Rank Sparing Disabled Branch 1 Rank Interleave 4 1 Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled High Bandwidth FSB Enabled High Temp DRAM OP Disabled AMB Thermal Sensor Disabled Managing the BIOS on NovaScale R4xx Machines 7 11 Thermal Throttle Global Activation Throttle Crystal Beach Feature Route Port 80h cycles to Clock Spectrum Feature High Precision Event Timer USB Function Legacy USB Support Frequency Ratio Core Multi Processing Machine Checking Thermal Management 2 C1 Enhanced Mode Execute Disable Bit Adjacent Cache Line Prefetch Hardware Prefetcher Advanced Processor Options Direct Cache Access Intel R Virtualization Technology Intel EIST support Serial port A Base I O address Serial port A Interrupt Serial port A I O Device Configuration Serial port B Mode Base I O address Serial port B Interrupt Serial port B Com Port Address Baud Rate Console Redirection Consoles Flow Control Console connection Continue C R after POST CPU Temperature Threshold Hard Monit Fan Speed Control Modes System Event Logging Clear System Event Log SYS Firmware Progress BIOS POST Errors BIOS POST Watchdog OS boot Watchdog Timer for loading OS min Time out action Supervisor Password Is User Password Is Security Password on boot Boot 2 3 4 7 12 BASS for Xeon M
57. 826 51 0 9606 51 E 7206 51 0 5215 YOH 3 0 config This action manually creates the instruction sequence needed to configure the hostname mapping for a switch 2 4 2 2 2 28 Note This option only applies to Voltaire switches which use 4 0 or later firmware versions ibs s switch name vNE a config group This action generates the group csv file that includes the hostname mapping configuration details for all the switches this can then be imported into a switch in order to configure it For large clusters this is quicker than running the config action as detailed above to generate and import the cluster switch configuration details into a switch Note This option only applies to Voltaire switches which use 4 0 or later firmware versions ibs s iswu0c0 0 a group While the command is being carried out a message similar to that below will appear Successfully generated configuration file group csv To update a managed switch proceed as follows Log onto the switch Enter the enable mod Enter the config menu Enter the group menu Type the following command group import home user path IBSDB Database It is possible to create a database which includes all the hardware and InfiniBand traffic details for all the switches with the IBS tool This database is specific to InfiniBand hardware The following commands apply to the IBSDB Datab
58. 98e8 Checking Ca nodeguid 0 0008 10403979910 Checking Ca nodeguid 0 0008 104039798 4 Troubleshooting 3 9 Checking Ca nodeguid 0x0008f10403979920 Checking Ca nodeguid 0x0008f10403979948 Checking Ca nodeguid 0x0008 104039798f4 Checking Ca nodeguid 0x0008f104039798d0 Checking Ca nodeguid 0x0008f10403977ca4 Summary 13 nodes checked 0 bad nodes found 24 ports checked 0 bad ports found 1 ports have errors beyond threshold 3 2 4 ibcheckwidth and ibcheckportwidth ibcheckwidth checks all nodes using the complete topology file which was created by ibnetdiscover to validate the bandwidth for links which are active and will also identify ports with 1X bandwidth ibcheckwidth Output Example Summary 40 nodes checked 0 bad nodes found 140 ports checked 0 ports with 1x width in error found ibcheckportwidth checks connectivity and the link width for a given port lid and will indicate the actual bandwidth being used by the port This should be checked against the maximum which is possible For example if the port supports 4 x bandwidth then this should be used Similarly if the adapter supports DDR then this should be used Syntax ibcheckportwidth h v lid guid port Example ibcheckportwidth v 0x2 1 Output Port check lid 0x2 port 1 OK 3 2 5 More Information Please refer to the man pages for more informati
59. Advanced Memory Configuration PCI Contiguration Peripheral Configuration 7 30 System Date date Enabled Disabled Disabled Disabled Hardware Prefetcher Enabled Adjacent Cache Line Prefetch Enabled English 05 Processor Retest Execute Disable Bit Intel SpeedStep R Technology C1 Enhanced Mode Virtualization Technology Processor Settings Memory Retest No Extended RAM Step Disabled Online Spare Memory Disabled Memory RAS Feature Normal Hot plug PCI Control Reserved memory space for PHP Disabled Onboard SAS Option ROM Scan Enabled LAN 1 Option ROM Scan Disabled LAN 2 Option ROM Scan Disabled LAN 3 Option ROM Scan Enabled LAN 4 Option ROM Scan Disabled PCI Slot 1 Option ROM Enabled PCI Slot 2 Option ROM Enabled PCI Slot 3 Option ROM Enabled PCI Slot 4 Option ROM Enabled PCI Slot 5 Option ROM Enabled PCI Slot 6 Option ROM Enabled PCI Slot 7 Option ROM Enabled Enabled 3F8 IRQ 4 Enabled 2F8 Onboard NIC Base I O address Interrupt Base I O address Serial port A Serial port B BASS for Xeon Maintenance Guide P Interrupt USB 2 0 Controller Enabled Legacy USB Support Enabled Serial ATA Enabled Native Mode Operation Auto Multimedia Timer Disabled Intel R I OAT Enabled Wake On LAN PME Enabled Wake On Ring Disabled Wake On RTC Alarm Disabled Boottime Diagnostic Screen Enabled Advanced Chipset Control Reset Configuration Data NumLock M
60. BAS for Xeon Maintenance Guide REFERENCE 86 A2 90EW 01 A HPC BASS for Xeon Maintenance Guide Hardware and Software September 2008 BULL CEDOC 357 AVENUE PATTON B P 20845 49008 ANGERS CEDEX 01 FRANCE REFERENCE 86 2 90EW 01 The following copyright notice protects this book under Copyright laws which prohibit such actions as but not limited to copying distributing modifying and making derivative works Copyright O Bull SAS 2008 Printed in France Trademarks and Acknowledgements We acknowledge the rights of the proprietors of the trademarks mentioned in this manual All brand names and software and hardware product names are subject to trademark and or patent protection Quoting of brand and product names is for information purposes only and does not represent trademark misuse The information in this document is subject to change without notice Bull will not be liable for errors contained herein or for incidental or consequential damages in connection with the use of this material Preface Intended Readers This guide is intended for use by qualified personnel in charge of maintaining and troubleshooting the Bull HPC clusters of NovaScale R4xx nodes based on Intel Xeon processors Prerequisites Readers need a basic understanding of the hardware and software components that make up a Bull HPC cluster and are advised to read the documentation listed in the Bibliography below
61. BMC of a remote node can be accessed using the ipmitool command man ipmitool or the higher level cluster oriented conman or NS commands See Chapter 2 in this manual Examples using the ipmitool command 1 To obtain the BMC LAN configuration for a NovaScale R42x machine channel 1 ipmitool H BMC IP addr gt U ADMIN P ADMIN lan print 1 2 To shutdown a remote machine ipmitool H BMC IP addr U ADMIN P ADMIN power soft 3 Toconnect to a remote console via SOL for NovaScale R421 R422 R422 E1 R423 R440 and R460 machines ipmitool I lanplus H BMC IP addr gt U ADMIN P ADMIN sol activate Enter to terminate the connection 4 To connect to a remote console via SOL for a NovaScale R421 Elmachine ipmitool I lanplus H BMC IP addr gt U ADMIN P ADMIN o intelplus sol activate 5 122 Tips for using ipmitools and SOL e payload is already active for another session it can be deactivated by running the ipmitool sol deactivate command e escape character can be changed to amp to prevent conflicts with ssh e Use the ESC and the number 2 keys instead of using the F2 key to access the BIOS on NovaScale R440 and R460 machines e Use the ESC and the minus keys instead of using the DEL key to access the BIOS on NovaScale R421 and R422 machines 6 1 2 3 Web remote access The BMC can be accessed using a web interface for Novascale R421 R422 R422 and R423
62. Device Power State sert eiim teen 2 37 2 5 Debugging Maintenance Tools asses prep 2 38 2 5 1 Modifying the Core 2 38 2 5 2 Identifying InfiniBand Network Problems ibdoctor ibtracert 2 38 2 5 3 Using dump tools with RHELS crash proc 2 22 2 41 2 5 4 Identifying problems in the different parts of a 2 43 2 6 Testing Maintenance nu usi Md ed qe Ep UM ERU 2 44 2 6 1 Checking Nodes after Boot Phase 2 44 Chapter 3 DET 2 SONI PERIERE TT TT QT 3 1 3 1 Troubleshooting Voltaire Networks 3 1 3 1 1 Voltaire s Fabric Manager NA 3 1 Table of Contents md Pub Diagnostiese a 3 2 3 173 Debugging MOO E 3 2 3 1 4 High Level Diagnostic Tools att eter 3 2 21 52 CU Diagneste a 3 3 3 1 6 Event Notification Mechanisms soe redemit 3 4 2 2 Troubleshooting InfiniBand Stacks cv vob e 3 5 3 2 1 SMPQUETY 3 5 2220
63. Example of IBS command topo action output ct ree erre ted 224 Example of IBS command bandwidth action output 2 26 Example of IBS command errors action 2 2 2 9 2 27 OpenlB Diagnostic Tools Software Stack 1 st msi 3 5 Example BIOS parameter setting screen for NovaScale 421 7 5 Example BIOS parameter setting screen for NovaScale 422 7 5 List of Tables Table 2 1 Table 3 1 viii Maintenance TOS esce evene autentici 2 1 Available troubleshooting options for storage commands 3 15 BASS for Xeon Maintenance Guide Chapter 1 Stopping Starting Procedures This chapter describes procedures for stopping and restarting Bull HPC cluster components which are mainly used for maintenance purposes The following procedures are described e 1 1 Stopping Restarting a Node e 1 2 Stopping Restarting an Ethernet Switch 1 3 Stopping Restarting a Backbone Switch e 1 4 Stopping Restarting the HPC Cluster 1 1 Stopping Restarting Node 1 1 1 Stopping a Node Follow these steps to stop a node 1 Stop the customer s environment Check that the node is not running any applications by using the SINFO command on the management node All customer applications and connections should be stopped or closed including shells and mount points 2 Un mount the filesystem 3
64. FE FE AE FE AE FE FE AE FE FE FE AE FE FE AE FE AE E FE AE FE FE AE FE AE FE AE FE EFE GROUP lt nb simultaneous poweron gt lt time to wait gt lt period to wait gt lt time to wait after this GROUP gt etc clustmngt nsclusterstop conf HE HE AE FE AE FE FE AE FE FE EE EE RE EE First Part is used to control the power supply of DDN and servers FE FEAE FE AE AE FE AE FE FE AE FE AE AE FE AE FE FE AE FE FE AE FE AE FE FE AE FE FE AE FE AE FE FE AE FE FE AE FE AE FE FE AE FE FE AE FE AE E FE AE FE FE AE FE AE E FE AE FE FEFE E EH time to wait after poweroff for all servers being effectively down servers StopDelay 180 time to wait for ddn processing shutdown ddnShutdown Time 180 time to wait after poweroff for all powerswitches being OFF couplets StopDelay 30 HE EE EH FE AE FE AE EH AE FE FE AE FE AE AE FE FE AE FE FE AE FE AE E Following part is used to control the order to stop nodes groups GROUP nb simultaneous poweron time to wait period to wait time to wait after this GROUP 2 6 BAS5 for Xeon Maintenance Guide 2 2 3 Managing hardware nsctrl The nsctrl command carries out various tasks related to hardware This c
65. However it may be necessary to reconfigure the BMC to setup a new IP address or when the firmware is updated Follow the steps below to do this 1 Install the update bmc fw rpm onto the machine 2 Configure the LAN and SOL access to the BMC with the default user name administrator and default password administrator For the local BMC of the machine run the command bmc init param b BMC IP address m BMC net mask Fora remote BMC on a machine accessible through SSH run the command bmc init param b BMC IP address m BMC net mask s remote machine IP 6 6 BAS5 for Xeon Maintenance Guide Chapter 7 Managing the BIOS on NovaScale RAxx Machines This chapter describes how to update the BIOS on NovaScale RAXX machines It also defines the recommended settings for the BIOS parameters for these machines 7 1 Updating the BIOS on NovaScale R421 R422 R422 1 and RA23 This section describes how to update the motherboard BIOS of a NovaScale R421 R422 R422 E1 R423 machine Install the bios lt platform gt lt bios version gt rpm corresponding to your platform and to the new BIOS release The corresponding BIOS DOS image lt BIOS gt IMG is installed in usr local firmware wanne e Ensure that the BIOS version corresponding to your platform is used e BIOS upgrade MUST be interrupted whilst it is in course of operation e BIOS does not work a new BI
66. I O address Serial port A 3F8 Interrupt Serial port A IRQ 4 I O Device Configuration Enabled Mode Normal Base I O address Serial port 2F8 Interrupt Serial port B IRQ 3 Parallel Port Disabled Floppy disk controller Disabled 7 Event Loggin Enabled Com Port Address On board COM B Baud Rate 115 2K Console Redirection s Flow Control None Console connection Direct Continue C R after POST On System Event Logging Enabled Clear System Event Log Disabled SYS Firmware Progress Disabled BIOS POST Errors Enabled BIOS POST Watchdog Disabled OS boot Watchdog Disabled Timer for loading OS min 10 Time out action No Action IP Address BMC IP address IP Subnet Mask Default Gateway MAC Address 0 5 Supervisor Password 15 Clear Security User Password 15 Clear Password on boot Disabled USB FDC USB KEY IDE CD USB CDROM USB LS120 PCI BEV IBA GE Slot 0700 v1270 First disk Managing BIOS on NovaScale R4xx Machines 7 19 7 4 7 NovaScale R425 BIOS Settings mainboard X7DWA N R425 BIOS 7DWA4308 ROM Rev 1 1 BIOS Setup Section System Time lt Current local time gt System Date lt Current date gt Legacy diskette A Disabled IDE Channel 0 Master Auto IDE Channel Slave Auto SATA Port 0 Auto SATA Port 1 Auto SATA Port 2 Auto SATA Port 3 Auto Parallel ATA Enabled Serial ATA Enabled SATA Controller Mode Option Enhanced SATA RAID Enable Disabled SATA AHCI Ena
67. It is usually possible to focus the debug mode on the problematic part of the kernel which has been identified after recompilation It is also possible to insert code e g printk to help examine the problematic part The different compilation tasks for a machine stopping starting resetting creating a dump bootstrapping a compiled system and debugging may be carried out from a remote work station connected to a development machine configured as a DHCP server Day to Day Maintenance Operations 2 43 2 6 2 6 1 2 6 1 1 2 5 1 2 2 6 1 3 2 44 Testing Maintenance Tools Checking Nodes after Boot Phase postbootchecker postbootchecker detects when a Compute Node is starting and runs check operations on this node after its boot phase The objective is to verify that CPU and memory parameters are coherent with the values stored in the ClusterDB and if necessary to update the ClusterDB with the real values Prerequisites e syslog ng must be installed and configured as follows Management Node management of the logs coming from the cluster nodes Compute nodes detection of the compute nodes as they start e postbootchecker service must be installed before the RMS service to avoid jobs being disturbed postbootchecker Checks for the Compute Nodes The postbootchecker service etc init d postbootchecker detects every time a Compute Node starts Whilst the node is starting up postbootchecker runs three s
68. OS chip must be ordered 714 To install a new BIOS locally 1 Copy the lt BIOS gt IMG file onto an USB key dd if usr local firmware lt BIOS gt IMG of dev sd lt your USB device 2 Insert the key and reboot the machine The autoexec file contained in the DOS file automatically starts the BIOS update Wait for the BIOS installation to finish 3 Remove the USB key 4 Restart the machine Managing the BIOS on NovaScale R4xx Machines 7 1 7412 7 1 2 7 2 To install a new BIOS remote machine using PXE Note The remote machine must be configured to boot via PXE on the server The server must be configured as a TFTP server 1 Install the update bios rpm on the server 2 remote machine is accessible using IPMI run this command on the server update bios remote IP address usr local firmware lt BIOS gt IMG BMC IP address or if the server can connect to the remote machine using ssh then run this command update bios remote IP address gt usr local firmware lt BIOS gt IMG 3 The update bios command returns after the BIOS update is completed on the remote machine Usage update bios lt ipaddr gt bios image bmc ipaddr gt user name user passwd gt ipaddr gt network address of remote machine to have BIOS update bios image local path to the BIOS DOS image file bmc ipaddr BMC address of remote machine user name BMC user name user pass
69. PROM Configure Disabled Option ROM Re Placement Disabled PCI Parity Error Forwarding Disabled PCI Fast Delayed Transaction Disabled Reset Configuration Data No Option ROM Scan Enabled SLOT1 PCI Exp x16 Enable Master Enabled Latency Timer Default Memory Cache PCI Configuration Advanced SERR signal condition Single bit Chipset Control Clock Spectrum Feature Disabled Intel VT for Directed I O VT d Disabled AGB PCI Hole Granularity 256 MB Memory Voltage Auto Memory Branch Mode Interleave Branch O Rank Interleave 4 1 Branch 0 Rank Sparing Disabled Branch 1 Rank Interleave 4 1 Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled Demand Scrub Enabled High Temp DRAM OP Disabled 7 14 BAS5 for Xeon Maintenance Guide AMB Thermal Sensor Disabled Thermal Throttle Disabled Global Activation Throttle Disabled Force ITK Config Clocking Disabled Snoop Filter Enabled Crystal Beach Feature Enabled Route Port 80h cycles to LPC High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Default Enabled Enabled Enabled Enabled Disabled Enabled Enabled Enabled Disabled Disabled Disabled Disabled 12MHz Enabled 3F8 IRQ 4 Enabled Normal Frequency Ratio Core Multi Processing Machine Checking Fast String operations Thermal Management 2 Advanced C1 C2 Enhanced Mode Processor Execute Disable Bit Options Adjacent Cache Line Prefetch Hardware Prefetcher Set Max Ext CPUID
70. Populating switch chassis with boards Assigning ports to IB hosts Connecting ports Looking for program smpquery Updating hostnames using OFED smpquery Looking for program perfquery ISR9096 0 ISR9288 2012 0 total 24 No board found boards 0 chassis 0 assigned 74 total 74 assigned 37 pairs total 37 pairs using usr local ofed bin smpquery updated 24 failed 0 total 24 using usr local ofed bin perfquery Updating port counters using OFED perfquery updated 74 assigned 74 Connecting to database clusterdb on host localhost 5432 Updating equipment localisation from database clusterdb Assigning portcounters Updating equipment IP addresses from database clusterdb Updating switch IDs from database clusterdb Connecting to database ibsdb on host localhost 5432 Populating table chassis in database ibsdb Populating tables asic and chassis in database ibsdb Populating table board in database ibsdb Populating table asic in database ibsdb Populating table hca in database ibsdb Populating tables asic port and hca port in database ibsdb Populating tables asic portcounters and hca portcounters failed 0 total 74 not assigned 0 total 74 Done 24 localisations updated 24 IP addresses updated 21 switch IDs updated Done 0 chassis stored 3 ISR9024 switch stored 0 boards stored 0 ASICs stored 21 HCAs stored 74 ports stored 74 portcounters stored
71. SATA Port 3 Parallel ATA Serial ATA SATA Controller Mode Option SATA Raid enable SATA AHCI enable QuickBoot Mode QuietBoot Mode POST Errors ACPI Mode Power Button Behaviour Resume On Modem Ring EFI os boot Power Loss Control Watch Dog Summary screen Cache System BIOS area Cache Video BIOS area Cache Base 0 512k Cache Base 512k 640k Cache Extended Memory Area Discrete MTRR Allocation Onboard G LAN1 OPROM Configure Onboard G LAN2 OPROM Configure Option ROM Re Placement PCI Parity Error Forwarding PCI Fast Delayed Transaction Reset Configuration Data Frequency for PCIX 1 2 Option ROM Scan Enable Master Latency Timer Option ROM Scan Enable Master Latency Timer SLOTO PCI U X8 SLOTI PCI X 133MHz Current local time Current date Disabled Auto Auto Auto Auto Auto Auto Enabled Enabled Enhanced Disabled Enabled Enabled Disabled Disabled Yes Instant Off Off Disabled Stay Off Disabled Disabled Write Protect Write Protect Write Back Write Back Write Back Disabled Enabled Disabled Disabled Disabled Disabled No Auto Enabled Enabled Default Enabled Enabled Default SLOT2 PCI X 133MHz Option ROM Scan Enabled Managing the BIOS on NovaScale R4xx Machines 7 17 Enabled Default Enable Master ERR 22228 Option ROM Scan Enabled Enable Master Enabled Latency Timer Default Option ROM Scan Enabled SLOTA PCI Exp x4 Enable Master Enabled Latency Timer D
72. STDOUT LEVEL RACE LOG FILE LEVEL storregister storregister 2 Editthe storframework conf file Uncomment one of the two previous lines Choose a level of trace between 1 lowest and 4 highest level For example to add traces of debug level 4 highest level on stdout only the storframework conf file must contain the following lines STDOUT trace level configuration storregister_TRACE_STDOUT_LEVEL 4 log file trace level configuration LOG_FILE_LEVEL storregister_TRACE 3 Save the storframework conf file 4 Relaunch storregister New traces will appear on the stdout 3 4 1 3 Available Troubleshooting Options for Storage Commands The following table sums up the available troubleshooting options for the storage commands Command User option Log Traces Name of the Command corresponding conf File fcswregister Yes iorefmgmt Yes ioshowall Yes Isiocfg Yes Yes Isiodev Yes nec admin Yes Yes nec admin conf nec stat Yes stordepha Yes storcheck Yes Yes storframework conf stordepmap Yes Yes stordiskname Yes storiocellctl Yes Yes storframework conf storioha Yes 3 14 BAS5 for Xeon Maintenance Guide 3 4 1 4 Command User option Log Traces Name of the Command corresponding conf File storiopathctl
73. Specify a configuration file default etc clustmngt nsclusterstart conf or etc clustmngt nsclusterstop conf h Display nsclusterstart nsclusterstop help only_test o Display the commands that would be launched according to the specified options This is a testing mode no action is performed verbose v Verbose mode Configuration files etc clustmngt nsclusterstart conf FE TE TE FE TE FE FE FE FE FE FE FE FE FE FE FE HE HE HE H First Part is used to cont FETE TE FE TE FE FE FE FE FE FE FE FE FE FE HE HE HE HE H FEAE TE AE TE FE AE TE E HEHEHE EE EERE EE E E E E E EE EE rol the power supply of DDN and servers FE FEAE TE FEAE TE AE FE FE AE FE FE AE FE AE FE FE FE FE AE AE FE AE FE FE AE FE FEAE FE AE FE EFE EH time to wait for all diskarrays ok before powering the powerswitches on disk arrays StartDelay 300 Day to Day Maintenance Operations 2 5 time to wait for all powerswitches being ON after a poweron couplets StartDelay 60 time to wait after poweron for all servers being effectively operational Servers StartDelay 480 FE FEAE FE AE AE FE AE FE FE AE FE AE AE FE AE FE FE AE FE AE AE FE AE FE FE AE FE FE AE FE AE FE FE AE FE FE AE FE AE E FE AE FE FE AE FE AE E FE AE AE FE AE FE AE E FE AE HE EHE E EH Following part is used to control the order to start nodes groups FE FEAE FE AEE FE AE FE FE AE FE AE AE FE AE FE FE AE
74. V 0003 SERIAL 3KROKTHM000075475NWC TRANSPORT SPI sda 0202970 bru SEAGATE running 286102 31 MODEL SEAGATE ST3300007LC FWREV 0003 SERIAL 3KROJT0TO0007548GUXA TRANSPORT SPI sdd 2 0 0 0 8 48 DDN running 10000 30 dev ldn ddn0 13 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 02A820510D00 TRANSPORT FC WWPN 24 00 00 01 ff 03 02 a8 NAME unknown sde 2 0 0 1 8 64 DDN running 125000 30 dev ldn ddn0 14 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 02A820540E00 TRANSPORT FC WWPN 24 00 00 01 ff 03 02 a8 NAME unknown sdf 2202072 8 80 DDN running 10000 30 dev ldn ddn0 15 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 03E020570F00 TRANSPORT FC WWPN 24 00 00 01 ff 03 02 a8 NAME unknown sdg 2 0 0 3 8 96 DDN running 125000 30 dev ldn ddn0 16 MODEL DDN S2A 8500 FWREV 5 20 SERIAL 03E0205A1000 TRANSPORT FC WWPN 24 00 00 01 03 02 a8 NAME unknown 2 4 4 3 Disk Usage and Partition Inventories These inventories give information about system and logical use of the devices Such information is mostly used for system administration needs 2 36 BAS5 for Xeon Maintenance Guide 2 4 5 Checking Device Power State pingcheck The pingcheck command checks the power state on or off of the specified devices Usage pingcheck options device type command devices Options dbname name Specify database name debug d Debug mode more than verbose h Display pingcheck help interval i Specify the number of
75. able to see current problems In the VFM right click and select Alarm Data to get information to help identify where the problem is located 5 Use the Topology Map to identify nodes with a current alarm 6 Proactively look for increasing error counters using the statistics feature and running the Diagnostic scripts using the CLI Note See the Voltaire Switch User Manual ISR 9024 ISR 9096 and ISR 9288 2012 Switches for full details on using these tools 3 2 BASS for Xeon Maintenance Guide 3 1 5 3 1 5 1 9 1 9 2 Diagnostic Tools zero counters script To clear out all the errors across the fabric use the zero counters script to traverse the fabric and clear out all the port counters on both the switches and HCAs This script is very easy to use and is helpful if you want to start off with a clean baseline of your fabric after many changes have occurred ISR9288 utilities zero counters Zero All Counters lid 1 ports 24 kk ck ck ck ck ck ck ck ckck ck ckckckckckck ck ck KK KK lid 5 ports 24 KKK KKK KKK KKK KKK KKK KKK KKK lid 4 ports 24 KKKKKKKK KKK KKK KKK KKK KK KK lid 3 ports 24 KKK KKK KKK KKK KKK KKK KKK KKK lid 2 ports 24 KKKKKKKKKKKK KKK KKK KKK KKK lid 11 ports 24 KKK KKK KKK KKK KK KKK KK KKK KK Note See the Voltaire Switch User Manual ISR 9024 ISR 9096 and ISR 9288 2012 Switches for full details on the CLI commands width check script Another valuab
76. aintenance Guide Event Logging Enabled DMI Event L ECC Event Logging Enabled Disabled Disabled Enabled LPC Disabled No Enabled Enabled Default Enabled Enabled Enabled Disabled Enabled Enabled Enabled Disabled Disabled Disabled Enabled 3F8 IRQ 4 Enabled Normal 2F8 IRQ 3 On board COM B 115 2K VT100 None Direct 75 C 2 3 pin Server Enabled Disabled Disabled Enabled Disabled Disabled 10 No Action Clear Clear Disabled USB FDC USB CDROM USB KEY USB LS120 PepperC Virtual disc PCI BEV IBA GE Slot 0400 v1236 IDE 4 WDC WD1600YS 01SHB1 S2 Managing the BIOS on NovaScale R4xx Machines 7 13 7 4 5 NovaScale R422 E1 BIOS Settings motherboard X7DWT R422 EI BIOS 1 06 7DWTC217 System Time lt Current local time gt System Date lt Current date gt Main Serial ATA Enabled Native Mode Operation Serial ATA SATA Controller Mode Option Compatible Advanced QuickBoot Mode Enabled QuietBoot Mode Disabled POST Errors Disabled ACPI Mode Yes Power Button Behaviour Instant Off Resume On Modem Ring Off EFI OS Boot Disabled Power Loss Control Stay Off Watch Dog Disabled Summary screen Disabled Cache System BIOS area Write Protect Cache Video BIOS area Write Protect Cache Base 0 512k Write Back Cache Base 512k 640k Write Back Cache Extended Memory Area Write Back Discrete MTRR Allocation Disabled Onboard G LAN1 OPROM Configure Enabled Onboard G LAN2 O
77. an set the trace level in the appropriate etc storageadmin conf file There are two lines in these files to set the trace These lines look as follows where command name is the name of the command to debug d command name TRACE STDOUT LEVEL d command name TRACE LOG FILE LEVEL The first line is used to activate traces on stdout the second one is used to generate traces in a tmp storregister PID traces log file By default the two lines are in comment Note recommended to use this trace tool only for temporary debugging because there is no automatic cleaning of the tmp lt command_name gt PID traces log files Four levels of traces are available e gt TRACE LEVEL DEBUG e gt TRACE_LEVEL_INFO e 2 gt TRACE LEVEL WARNING e gt TRACE LEVEL ERROR Level 4 is the most verbose level level 1 traces only error messages Note Itis not possible to add new commands All the commands accepting this system of traces are listed in the corresponding conf file See 55 for Xeon Administrator s Guide to identify the right configuration file Troubleshooting 3 13 Example The following example explains how to obtain log file and or stdout traces on storregister command 1 Find the right etc storageadmin conf file to modify In the case of the storregister command it is storframework conf because of the presence of these two lines RACE
78. and checks all the possible routes between the adapters ibdoctor T BAS5 for Xeon Maintenance Guide The output looks as follows 28 lids found ISR9024D Voltaire ISR9024D Voltaire ISR9024D Voltaire ISR9024D Voltaire ISR9024D Voltaire ISR9024D Voltaire bali6 HCA 1 ISR9024D Voltaire ISR9024D Voltaire bali7 HCA 1 ISR9024D Voltaire ISR9024D Voltaire ISR9024D M Voltaire 11 0 1 port 0 guid 0008 10400411 54 state Active width 4X rate 2 5 lid 0x2 port 15 guid 0008f10400411d6a state Active width 4X rate 5 0 ISR9024D M Voltaire 11 0 1 port 0 guid 0008 10400411 54 state Active width 4X rate 2 5 lid 0 11 port 13 guid 0008f10400411da2 state Active width 4X rate 5 0 lid 0 11 port 18 guid 0008f10400411da2 state Active width 4X rate 5 0 lid 0x3 port 6 guid 0008f10400411d70 state Active width 4X rate 5 0 ISR9024D M Voltaire lid 0 1 port 0 guid 0008 10400411 54 state Active width 4X rate 2 5 lid 0x2 port 15 guid 0008 1040041146 state Active width 4X rate 5 0 lid 0x2 port 4 guid 0008 1040041146 state Active width 4X rate 5 0 RACK1 D lid 0x4 port 1 guid 0002c90200234405 state Active width 4X rate 5 0 ISR9024D M Voltaire 11 0 1 port 0 guid 0008 10400411 54 state Active width 4X rate 2 5 lid 0x2 port 16 guid 0008 10400411 state Active width 4X rate 5 0 lid 0x2 port 5 guid 0008 10400411 state Active width 4X rate 5 0 RACK1 E lid 0x5 port 1 guid
79. ase dbcreate To create an empty new IBS database ibsdb use the dbcreate command Only the postgres user is allowed to create an empty database postgres admin ibs a dbcreate While the command is being carried out a message similar to that below will appear Looking for program createdb using usr bin createdb Looking for program psql using usr bin psql Creating database ibsdb Done Loading table definitions into database ibsdb Done BAS5 for Xeon Maintenance Guide dbdelete To delete an IBS database ibsdb use the dbdelete command Only the postgres user is allowed to delete an empty database postgres admin ibs a dbdelete While the command is being carried out a message similar to that below will appear Looking for program dropdb using usr bin dropdb Deleting database ibsdb Done dbpopulate Use the dbpopulate action to populate a new database In the example below data is supplied from iswuOcO O0 managed switch from the Management Node and the hostnames and traffic counters are populated using the OFED tools ibs s iswu0c0 0 a dbpopulate vNE While the command is being carried out a message similar to that below will appear Connecting to switch iswu0c0 0 Sending request for file NetworkMap xml Done Done Done Getting response header from switch iswu0c0 0 Downloading NetworkMap xml Creating IB hosts 21 ASICS 0 ISR9024 3 Populating boards
80. at test is a directory tesi will exclude all items in test also test will NOT exist upon restore test will exclude all items in the directory test but test will be created upon restore lt is essential to exclude as many directories as possible in order to reduce the number of DVDs used for the backup For Bull HPC Clusters the following directories can be excluded e release e tmp excluded by default e excluded by default e ltest if it exists If you do not need to save the KSIS images you can exclude also e var lib systemimager images e var lib systemimager scripts e var lib systemimager overrides Note The RPMs that are installed on the Management Node are in the release directory It is not necessary to save it because these RPMs can be retrieved from the installation CDs Day to Day Maintenance Operations 2 17 2 3 3 Backing up a system 2 3 3 1 Un mount the Mounted Drives It is recommended to un mount the mounted drives assuming the mounted data does not need to be saved 2 3 3 7 Stop Services E All activity on Management Node must be stopped when creating the backup The ClusterDB must not be used during the backup operation The following services should all be stopped before running BSBR e lustre e ganglia e postgresql To stop the Lustre service run the following commands lustre util umount f all
81. be done manually by using the command echo 1 gt proc sys kernel unknown_nmi_panic An NMI dump may be launched using IPMI via the command ipmitool H bmc address U user name P lt pwd gt chassis power diag or by using the nsctrl command See http kbase redhat com fag FAQ_105_9036 shtm for more information Day to Day Maintenance Operations 2 41 Notes If watchdog is still active after the kernel unknown nmi panic 1 option is set the machine will no longer boot e For this release of BASS for Xeon the IPMI power diag command will launch a dump for NovaScale R423 NovaScale R440 and NovaScale R460 series machines e There is also a dump button on the back of the NovaScale R460 series machines that will launch an NMI dump for these machines Further information can be found in the kdump man pages b It is essential to use non stripped binary code within the kernel Non stripped binary code is included in the debuginfo RPM available from http people redhat com duffy debuginfo index js html This package installs the kernel binary in the folder usr lib debug lib modules kernel version 2 42 BAS5 for Xeon Maintenance Guide 2 5 4 Identifying problems in the different parts of a kernel Various configuration parameters enable traces or additional checks to be used on different kernel operations for example locks memory allocation and so on
82. bios R421E1 lt bios version gt zip package is installed in the usr local firmware directory Run the command ofupdate b usr local firmware bios RA21El bios version gt zip 4 Reboot the machine so that the new BIOS is active 7 2 2 Installing the Bull HPC BIOS setup on local RA21 E1 machine 1 Run the command ofupdate c usr local firmware bios bios version scf 2 Reboot the machine Note BIOS setup configuration file for a particular BIOS version can only be restored a machine which uses the same BIOS version Managing the BIOS on NovaScale R4xx Machines 7 3 7 2 3 Installing a new BIOS a remote RA21 1 platform 1 Install ofu RPM package that contains the OFU Linux tools on the remote R421E1 platform this can be done using SSH 2 Install the bios RA21ET bios version RPM that contains the BIOS package and BIOS setup configuration file on the local machine 3 The corresponding bios R421E1 lt bios version gt zip package is installed in usr local firmware 4 Runthe command ofupdate b usr local firmware bios RA21El bios version gt zip s remote R421E1 system IP address gt 5 Reboot the remote machine 7 2 4 Installing the Bull BIOS Setup Configuration File on a remote R421 machine 1 Runthe command ofupdate c usr local firmware bios bios version scf s remote R421E1 system IP address gt 2
83. ble Enabled Advanced QuickBoot Mode Enabled QuietBoot Mode Disabled POST Errors Disabled ACPI Mode Yes ACPI Sleep Mode 51 Power Button Behavior Instant Off Resume On Modem Ring Off EFI os boot Disabled Keyboard On Now Function Disabled Boot Features Power Loss Control Stay Off Watch Dog Disabled Summary screen Disabled Cache System BIOS area Write Protect Cache Video BIOS area Write Protect Cache Base 0 512k Write Back Memory Cache Cache Base 512k 640k Write Back Cache Extended Memory Area Write Back Discrete MTRR Allocation Disabled PCI Configuration Onboard 1 OPROM Configure Enabled Onboard G LAN2 OPROM Configure Disabled Default Primary Video Adapter Other Option ROM Re Placement Disabled ROM Scan Ordering Onboard First 7 20 BAS5 for Xeon Maintenance Guide PCI Parity Error Forwarding Disabled PCI Fast Delayed Transaction Disabled Reset Configuration Data No Frequency for PCIX 1 2 Auto Option ROM Scan Enabled SLOTO PCI U X8 Enable Master Enabled Latency Timer Default Option ROM Scan Enabled Enable Master Enabled Latency Timer Default Option ROM Scan Enabled Enable Master Enabled Latency Timer Default Option ROM Scan Enabled SLOT3 PCI 33MHz Enable Master Enabled Latency Timer Default Option ROM Scan Enabled SLOT4 x16 Enable Master Enabled Latency Timer Default Option ROM Scan Enabled SLOTS PCI 33MHz Enable Master Enabled Latency Timer Default Option ROM Scan
84. can be disabled or reset using the pormanage command as below iswu0c0 0 utilities port manage Description port manage sh is used to trigger a physical state change for the port specified This is useful when the active width speed of a specific port must be changed without the cable being reconnected Syntax port manage sh f lt d e r gt lt LID gt PORT Options y Increase output verbosity level f Force disabling or resetting a port even when the port is located on the Access Path path way to the specific port d lid port Disable the port BAS5 for Xeon Maintenance Guide lid port r lid port 5 lid port D lid port h Example Enable the port set port state machine to polling state Reset the port Reset the port and set Enabled Speed to SDR Reset the port and set Enabled Speed to SDR DDR Show this help port manage sh r 17 21 reset LID 17 PORT 21 Day to Day Maintenance Operations 2 33 2 4 4 Getting Information about Storage Devices Isiocfg Isiocfg is a tool used for reporting information about storage devices It is mainly dedicated to external storage systems DDN and FDA disk arrays and their dedicated Host Board Adapters Emulex FC adapters but it can also be used with internal system storage system disks and their Host Board Adapters tools Reported information is related to several inventories e Host Board Adapters lt flag e Disks d flag
85. command BAS5 for Xeon Maintenance Guide stormap This command checks the state of the virtual links Note This command is included in the global checking performed the ioshowall command lctl dl This command checks the current status of the OST MDT services on the node For example 1 UP lov 51 lov e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 2 UP osc OSC nova9 ost nova6 ddn0 11 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 3 UP osc OSC nova9 ost novalO0 ddn0 5 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 4 UP osc OSC nova9 ost nova6 ddn0 3 MNT clientelan e0000047fcfff680 5b02a458d 544e 974f 8c92 23313049885e 4 5 UP osc OSC nova9 ost noval0 ddn0 21 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 6 UP osc OSC 9 ost nova6 ddn0 19 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 7 UP osc OSC nova9 ost novalO0 ddn0 7 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 8 UP osc OSC nova9 ost nova6 ddn0 1 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 9 UP osc OSC nova9 ost noval0 ddn0 23 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 10 UP osc OSC nova9 ost nova6 ddn0 17 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313049885e 4 11 UP osc OSC nova9 ost noval0 ddn0 13 MNT clientelan e0000047fcfff680 b02a458d 544e 974f 8c92 23313
86. connection output to file L Display license information Monitor connection read only q Query server about specified console s Q Be quiet and suppress informational messages r Match console names via regex instead of globbing v Be verbose V Display version information Once a connection is established enter amp to close the session amp to display a list of currently available escape sequences See the conman man page for more information Examples e connect to the serial port of NovaScale 511147 run the command conman bull47 Configuration File The etc conman conf file is the conman configuration file It lists the consoles managed by conman and configuration parameters The etc conman conf file is automatically generated from the ClusterDB information To change some parameters the administrator should only modify the etc conman tpl conf template file which is used by the system to generate etc conman conf It is also possible to use the dbmConfig command See the Cluster Data Base Management chapter for more details See the conman conf man page for more information Note The timestamp parameter which specifies the watchdog frequency is set to 1 minute by default This value is suitable for debugging and tracking purposes but generates a lot of messages in the var log conman file To disable this function comment the line SERVER timestamp im in the etc conman
87. count 544113 priority 3 state 3 SMINFO MASTER The guid that is identified can then be used to find the corresponding switch name in the ibsdb chassis table dbupdatepc Use the dbupdatepc action to update the port counters for an existing IBSDB database Use the command below ibs a dbupdatepc vNE availability Use the availability action to see which ports and links are available for the InfiniBand interconnects This action will not work unless the IBSDB database has been created and populated ibs s iswu0c0 0 a availability This will give results in a similar format to that below Active ports 74 Active uplinks 16 Active downlinks 21 Return Values IBS returns O for success Any other value indicates a failure BAS5 for Xeon Maintenance Guide 2 4 3 Monitoring Voltaire Switches switchname Different options exist for monitoring and maintaining the performance of Voltaire switches To begin with enter the utilities menu as follows user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname utilities switchname utilities 2 4 3 1 Resetting the counters The counters volume and errors can be reset through the zero counters command as follows switchname utilities zero counters Zero All Counters Zero lid 8 port 255 mask Oxffff b see 1 2 4 3 2 Finding bad ports The find bad ports command can be use
88. cripts to retrieve information about processors and memory These scripts are the following Script name Description procTest pl Retrieves the number of CPUs available for the node memTest pl Retrieves the size of memory available for the node modelTest pl Retrieves model information for the CPUs available on the node Then postbootchecker returns this information to the Management Node using syslog ng postbootchecker checks for the Management Node On the Management Node the postbootchecker server gets information returned from the Compute Nodes and compares it with information stored in the ClusterDB e number of CPUs available a node is compared with the nb cpu total value in the ClusterDB e size of memory available for a node is compared with the memory size value in the ClusterDB e CPUs model type for a node is compared with the cpu model value in the ClusterDB If discrepancies are found the ClusterDB is updated with the new values In addition the Nagios status of the postbootchecker service is updated as follows e Ifthe discrepancies concern the number of CPUs or the memory size the service is set to CRITICAL e Ifthe discrepancies concern the model of the CPUs the service is set to WARNING If no discrepancies were found the service is OK BAS5 for Xeon Maintenance Guide Chapter 3 Troubleshooting Troubleshooting deals with the unexpected and is an importa
89. d follow the instructions that are displayed on the screen 5 ThelSO images to be burnt to DVD will be created in the directory specified in the configuration file ISO DIR parameter By default this is tmp Note Check the mkcdrec log file in case of problems 6 It is recommended to burn immediately the ISO image on DVD To do this please refer to the Bull System Backup Restore User s Guide 2 3 4 Restoring a System To restore a system boot on the first DVD ROM An automatic procedure is started which is suitable in most cases For more control over the restoration procedure you can stop the automatic restoration by pressing Enter when the following message is displayed Automatic Disaster Recovery AUTODR Mode is active Press Enter key to interrupt AUTODR mode within 20 seconds Then launch the restore manually using the following commands cd etc recovery start restore sh When the restore is completed reboot the machine using the reboot command Day to Day Maintenance Operations 2 19 2 4 2 4 1 2 4 1 1 2 20 Monitoring Maintenance Tools Checking the status of InfiniBand Networks ibstatus ibstat ibstatus Command ibstatus displays basic information obtained from each InfiniBand driver for the local adapter included in an InfiniBand network Normal output includes LID Subnet Manager LID port state UP or DOWN port physical state and the link width in term
90. d to detect faulty ports switchname utilities find bad ports node dre eee ee UR 0008 10400411 946 Roane cian Mate TISR9024D Voltaire JEn acute tees csetera ue eine ea E de 152 aces VI taie hte qot etn 8 Port 4 2 4 3 3 Verifying the ports The whole Infiniband fabric can be checked using the portverify command as follows switchname utilities port verify Day to Day Maintenance Operations 2 3 2 4 3 4 2 4 3 5 2 32 Topology file generated on Thu Oct 4 20 19 24 2007 devid 0x5a31 switchguids 0x8f1040041254a Switch 24 5 0008 1040041254 ISR9024D M Voltaire smalid 8 1 S 0008 10400411946 13 width 4X speed 5 0 Gbs 2 S 0008f 10400411946 14 width 4X speed 5 0 Gbs 3 S 0008 10400411946 15 width 4X speed 5 0 Gbs ere 1 devid 0x6282 hcaguids 0x2c9020024b940 2 H 0002c9020024b940 zeus8 HCA 1 1 S 0008f 1040041281e 1 lid 72 3 width speed 5 0 Gbs SUMMARY NO PROBLEMS DETECTED Checking the port width To ensure the best performance check that the ports are running in 4x mode as follows switchname utilities width check Verify every error found will be printed lid 8 guid 0008f1040041254a ports 24 lid 160 guid 0008 1040041281 ports 24 lid 152 guid 0008 10400411946 ports 24 Dealing with a faulty port When a faulty port is diagnosed it
91. de2 node3 bd210a7 all tmp pdsh w nodel node2 node3 lpflash m 1p11000 f tmp bd210a7 all Day to Day Maintenance Operations 2 15 2 3 Saving and Restoring the System BSBR To save and restore the Management Node system use BSBR Bull System Backup Restore BSBR is based on the mkCDrec make recovery CD ROM Open Source tool used to create a bootable Linux system image to restore the system after a problem has occurred such as a disk crash or a system intrusion BSBR is available on the Bull Extension Pack CD delivered with the Red Hat media E This section applies to the version of BSBR based mkcdrec 0 8 7 2 b 5 8 4 Bull RPM Note This section highlights the information you must be aware of when you use BSBR in a HPC environment For more information about installing and using the product please refer to the Bull System Backup Restore User s Guide 86 A2 73EV available on the Bull Extension Pack CD The system backups are saved on DVD ROM or on NFS mounted disk or tape Note is designed to back up the operating system in place on a node BSBR should NOT be used for data backup Use a different method to do this A typical example of usage is to run BSBR every night for a system and store the ISO images on another system via NFS or to burn the images onto DVD ROMs which can then be used to restore the system The Management Node system files should be backed up regular
92. dentify any port switch problems in the InfiniBand Fabric ibtracert 0x22 0 2 gt From ca 0008 10403979958 portnum 1 lid 0x22 0x22 lynx13 HCA 1 1 gt switch port 0008 104004118 2 8 lid 0 4 0 4 ISR9024D Voltaire 13 gt switch port 0008 104004118 8 16 lid 0x3 0x3 ISR9024D M Voltaire 21 gt switch port 0008 104004118 4 13 lid 0 1 0 1 ISR9024D Voltaire 4 gt ca port 0008 10403979985 1 lid 0 2 0 2 lynx19 1 0008 10403979984 portnum 1 lid 0 2 0 2 lynx19 HCA 1 In short gt OUT 1 13 lid 0x22 port 1 INTO node switch lid 0x4 port 8 OUT node switch lid 0x4 port 13 INTO top switch lid 0x3 port 16 OUT top switch lid 0x3 port 21 INTO node switch lid 1 port 13 OUT node switch lid 1 port 4 gt INTO lynx 19 lid 0 2 port 1 BASS for Xeon Maintenance Guide 2 5 3 2 5 3 1 Using dump tools with RHEL5 crash proc kdump Various tools allow problems to be analysed whilst the system is in operation e crash portrays system data symbolically using the possibilities provided by the GDB debugger The commands which it offers are system oriented for example the list of tasks tracing function calls for a task which is waiting etc See the crash man page for more information e system file proc may be used to view and if necessary modify system informat
93. dress where platform equals either r421 r422 for NovaScale R422 and R422 E1 machines or R423 Usage updatefw x86_64 f Firmware File updatefw x86_64 i IP Address u Usr p Pwd f Firmware File sdrload lt SDR file gt lt bmc ipaddr gt lt user name gt lt user passwd gt SDR file SDR file provided by sdredit command bmc ipaddr The BMC address of remote machine If no address is provided the local SDR repository is updated User name BMC user name user passwd BMC user password To update the BMC firmware using the Web interface See Bull NovaScale RA2x AOC SIMSO SIMSO Installation and User s Guide for more information Updating the BMC Firmware on NovaScale R421 E1 machines The BIOS and BMC firmware for NovaScale R421 E1 machines are updated together in the same operation using the IntelPOne boot Flash Utility OFU See section 7 2 in this manual for details regarding the use of this utility Accessing Updating and Reconfiguring the BMC Firmware on NovaScale R4xx machines 6 5 6 4 Updating the BMC firmware on NovaScale R440 and R460 machines The BMC update for these platforms is carried out using the Bull Update BIOS CD which is also used to upgrade the BIOS and FRUs and is available from the Bull support site Follow the instructions provided with the CD 6 5 Reconfiguring the BMC on R4xx machines The BMCs are configured in the factory before the machines are delivered
94. dule has to be loaded when using Iptools check with Ismod Firmware updates are available from Emulex Web site On a node you can get the current FW level from all the Emulex HBA using the lsiocfg tool getting information about storage devices WARNING Be sure that FC devices are not being used when upgrading the Emulex HBA firmware Iputil This low level tool should not be used in standalone mode Please refer to on line help when using this tool Ipflash Ipflash flashes Emulex HBAs with the specified firmware file lpflash may be used to upgrade in one shot all the HBAs on a server Syntax Ipflash m Model f path to firmware v gt lt h gt V Flags m model Emulex HBA model to flash case insensitive 4 file firmware file y verbose mode h displays help V displays version Example lpflash m 1p11000 f tmp bd210a7 all This command will upgrade all LP1 1000 HBA to 2 1027 firmware BAS5 for Xeon Maintenance Guide 2 2 6 3 Upgrade Emulex Firmware on Multiple Nodes Running the pdcp pdsh commands Emulex firmware can be upgraded in one shot on a set of nodes e use to copy the new firmware file on all the nodes e use pdsh to run Ipflash on these nodes Example The following commands copy the Emulex firmware file on to nodes 1 node2 and node3 and then upgrade all Emulex LP 11000 HBA on these nodes with firmware 2 10A7 pdcp w nodel no
95. e BSBR BSBR Bull System Backup restore 2 16 C CLI Remote Hardware Management 2 8 clustat command 3 20 ClusterDB CPU and memory values 2 44 Commands clustat 3 20 conman 2 2 crash 2 4 dbmConfig 1 3 e2fsck 3 16 ibchecknet 3 9 ibcheckportwidth 3 10 ibcheckwidth 3 10 ibdoctor 2 38 ibnetdiscover 3 9 ibstat 2 20 ibstatus 2 20 ibtracert 2 39 ioshowall 3 20 ipmitool 2 4 Ictl 3 21 Imdiag 3 27 Ipflash 2 14 Iputil 2 14 Isiocfg 2 34 lustre check 3 18 lustre migrate hastat 3 18 lustre migrate nodestat 3 18 lustre util 3 19 nec admin 3 15 nsctrl 1 1 openib 3 5 perfquery 3 7 postbootchecker 2 44 SINFO 1 1 smpquery 3 5 storioha 3 20 stormap 3 2 switchname 2 31 ulimit 2 38 ConMan using 2 2 conman conf file 2 3 Core Dump Size modifying 2 38 cpu model 2 44 crash 2 41 D dbmConfig command 1 3 Debugging tools 2 38 Dump Size modifying 2 38 E e2fsck command 3 16 Emulex FC adapter 2 34 Emulex HBA firmware upgrading 2 14 F FDA troubleshooting 3 15 files conman conf 2 3 mkcdrec Config sh 2 17 syslog ng conf 2 9 firmware update BMC 6 1 Index 1 1 InfiniBand switch 4 1 MegaRAID card 5 1 Voltaire switch 4 1 FLEXIm License Manager troubleshooting 3 27 H HA consistent state 3 22 HA Lustre troubleshooting 3 18 Hardware Management CLI 2 8 ibchecknet command 3 9 ibcheckportwidth command 3 10 ibcheckwidt
96. e Disk partitions flag e Disk usages Syntax According to needed information lsiocfg can be used with options related to each inventory e lsiocfg v c HBAs IDs Gives information about all SCSI controllers If HBAs IDs are specified only applies to this list of HBAs e lsiocfg d 4 devices names Gives information about SCSI devices u has to be used to display non disk devices If devices are specified only applies to this list of devices e lsiocfg p Displays partitions siocfg a Dsplays all cdp e siocfg user n remote node P c d a Gives information from remote node about controllers disks e siocfg M devices names Gives information about SCSI devices usage e siocfg 4 L lt wwpn gt Reports WWPN owner The l flag uses etc wwn file and the L flag uses cluster manager database e siocfg lt w W gt Displays all WWPN owners The flag uses etc wwn file and the W flag uses cluster manager database 2 34 BAS5 for Xeon Maintenance Guide General flags P No headers before a c d commands v Verbose before a c d commands WWPN verbose information is extracted from etc wwn file h Help message Exclusive with other options V Display the version Exclusive with other options Online help and a man page give information about lsiocfg usage 2 4 4 1 HBA Inventory Using the lsiocfg HBA inventory op
97. e ieee 1 RUE 1 ee les E Channel Adapter 2 SystemnOuE d dd a 0 0008 10403977 7 EC 0 0008 10403977 4 E 0 0008 10403977 a noe ane eee are 64 Dedi eer ere hee era se tear 0x5a04 REV ISU we ne Uv SC TRY E 0x000000a1 omo up 2 Vendor 0x0008f1 portinfo example An example of use of this command including the Local ID and the port number is below smpquery portinfo 45 1 The resulting information output will be similar to that displayed below PC 0x0000000000000000 ewe E 0 80000000000000 cT pH 0x002d CE 0x0003 LE 0x500a68 IsTrapSupported TsAutomaticMigrationSupported IsSLMappingSupported IsLedInfoSupported IsSystemImageGUIDsupported IsVendorClassSupported IsCapabilityMaskNoticeSupported prag COGS ie us west 0x0000 0 QR S 2 LinkWidthEnabled 1X or 4X LinkWidthSupported 1X or 4X 5 4X LinkSpeedSupported
98. e path to the vendor daemon on the VENDOR line if present or any right half of a string b of the form a b where is all lower case Any other changes will invalidate the license Be cautious when transferring data received by Mailers Many Mailers add characters at the end of line that may confuse the reader about the real license data 3 82 Using the Imdiag utility The Imdiag command analyzes a license file with respect to the SERVER the FEATURES license counts and dates It may help you to understand problems that may occur Imdiag attempts to checkout all FEATUREs and explains failures You may run extended diagnostics attempting to connect to the license manager on each port on the host 3 8 3 Using INTEL LMD DEBUG Environment Variable Setting this environment variable will cause the application to produce product diagnostic information at every checkout Daemon Startup Problems Cannot find license file Most products have a default location in their directory hierarchy or use opt intel licenses server lic The environment variable INTEL LICENSE FILE names this directory Startup may fail if these variables are set wrong or the default location for the license is missing No such Feature exists The most common reason for this is that the wrong license file or an outdated copy of the file is being used Retrying Socket Bind This means the TCP port number is already in use Almost always this means an Imgrd intel
99. ection Main Memory Configuration PCI Contiguration Advanced Peripheral Configuration Advanced Chipset Control Security 7 26 System Time lt Current local time gt lt Current date gt Disabled Processor Retest No System Date Hard Disk Pre Delay P Sui Execute Disable Bit Disabled rocessor Settings ro E Intel R Virtualization Tech Disabled Enhanced Intel SpeedStep R Tech Disabled English 05 Memory Retest No Extended RAM Step Disabled Memory RAS Feature Interleave Sparing Disabled VGA Controller Enabled Onboard VGA Option ROM Auto Scan LAN Controller Option ROM Scan LAN2 Option ROM Scan Onboard Video Controller Enabled Enabled Enabled Enabled Enabled Enabled 3F8 IRQ 4 Enabled 2F8 IRQ 3 Enabled Enabled Enabled Compatible Enabled Enabled Enabled Disabled Disabled Enabled Onboard LAN PCI Slot 1B Option ROM PCI Slot 1C Option ROM Base I O address Interrupt Serial port A Base I O address Interrupt Serial port B USB 2 0 Controller Parallel ATA Serial ATA SATA Controller Mode Option Multimedia Timer Intel R I OAT Wake On LAN PME Wake On Ring Wake On RTC Alarm BooHime Diagnostic Screen Reset Configuration Data No NumLock On Memory Processor Error Clear Clear Disabled Supervisor Password 15 User Password 15 Password on boot Fixed disk boot sector BAS5 for Xeon Maintenance Guide Disabled rr Sui iibi Serve
100. ed in Most of the time the information in the excluded node list allows the source of the problem to be identified without the need for further analysis Possible Deployment Problems There are 2 areas where deployment problems may occur Pre check problems Before the image is deployed node states are verified in the ClusterDB Database and through the use of nsm commands If there are any problems the nodes in question will be excluded for the deployment The error will be displayed once the deployment has finished and will also be logged in the tmp ksisServer ksis exclude nodes list file Troubleshooting 3 11 3 3 2 2 3 12 Image transfer problems Problems may occur during the phase when the image is being transferred onto the target nodes These problems are logged and centralised by Ksis on the Management Node The errors will be displayed once the deployment has finished and will also be logged in the tmp ksisServer ksis exclude nodes list file ksis image server logs ksis server logs are saved on the Management Node in var lib systemimager overrides ka d server log and Ksis server traces are saved on the Management Node in var lib systemimager overrides server log Note Traces are only possible for the ksis server and for client nodes if the deploy command is executed using g option ksis image client logs ksis client logs on the Management Node in var lib systemimager overrides imagi
101. efault Option ROM Scan Enabled SLOT5 x8 Enable Master Enabled Latency Timer Default Option ROM Scan Enabled SLOT6 PCI Exp x8 Enable Master Enabled Latency Timer Default Dos SERR signal condition Single bit Clock Spectrum Feature Disabled Intel VT for Directed O Disabled AGB PCI Hole Granularity 256 MB Memory Voltage Auto SLOTS PCI Exp 8 Memory Branch Mode Interleave Branch O Rank Interleave 4 1 Branch Rank Sparing Disabled Branch 1 Rank Interleave 4 1 Branch 1 Rank Sparing Disabled Enhanced x8 Detection Enabled Demand Scrub Enabled High Temp DRAM OP Disabled AMB Thermal Sensor Disabled Thermal Throttle Disabled Global Activation Throttle Disabled Force ITK Config Clocking Disabled Snoop Filter Enabled Crystal Beach Feature Enabled Route Port 80h cycles to LPC High Precision Event Timer No USB Function Enabled Legacy USB Support Enabled Advanced Chipset Control Frequency Ratio Default Core Multi Processing Enabled Machine Checking Disabled Fast String operations Enabled Thermal Management 2 Enabled C1 C2 Enhanced Mode Disabled Advanced Processor Options Execute Disable Bit Enabled Adjacent Cache Line Prefetch Enabled Hardware Prefetcher Enabled Set Max Ext CPUID 3 Disabled Direct Cache Access Disabled Intel R Virtualization Technology Disabled Intel EIST support Disabled 7 18 BASS for Xeon Maintenance Guide KBC Clock Input 12MHz Serial port A Enabled Base
102. emory Processor Error User Password ls Clear Supervisor Password 15 Clear Securily Password on boot Disabled Fixed disk boot sector Normal Power Switch Inhibit Disabled Disable USB Ports Disabled Server BIOS Redirection Port Serial Port B Baud Rate 115 2K Flow Control CTS RTS Console Redirection Terminal Type 100 Continue Redirection after POST Enabled Remote Console Reset Disabled Shared BMC LAN IP Address Subnet Mask BMC ip address Default Gateway DHCP Disabled HTTP Enabled Sai HTTP Port Number 80 HTTPS Enabled HTTPS Port Number 443 Telnet Enabled Telnet Port Number 23 SSH Disabled SSH Port Number 22 Assert INMI on PERR Enabled Assert NMI on SERR Enabled FRB 2 Policy Disable BSP Boot Monitoring Disabled Managing the BIOS on NovaScale R4xx Machines 7 31 BMC IRQ IRQ 11 ACAINK Stay Off USB FDC USB KEY 5 PCI BEV IBA GE Slot 1900 v1260 7 32 5 for Xeon Maintenance Guide Glossary and Acronyms Administration Configuration Tool B BAS Bull Advanced Server BIOS Basic Input Output System BMC Baseboard Management Controller BSBR Bull System Backup Restore C CL Command Line Interface D DDN Data Direct Networks DHCP Dynamic Host Configuration Protocol E ECT Embedded Configuration Tool F FDA Fibre Disk Array FRU Field Replaceable Unit FTP File Transfer Protocol G GCC GNU C Compiler GNU GNU s Not Unix GPL
103. erature Threshold 75 C Monitor Fan Speed Control Modes 1 Disable Full spe System Event Logging Enabled Clear System Event Log Disabled SYS Firmware Progress Disabled BIOS POST Errors Enabled BIOS POST Watchdog Disabled OS boot Watchdog Disabled Timer for loading OS min 10 Time out action Action Supervisor Password 15 Clear Security User Password 15 Clear Password on boot Disabled USB FDC 2 USB CDROM 3 USB KEY 4 PCI BEV IBA GE Slot 0400 v1236 Boot 5 IDE 4 WDC WD1600YS 01SHB1 52 6 7 8 7 8 5 for Xeon Maintenance Guide 7 4 3 NovaScale R421 E1 BIOS Settings motherboard 554005 R421 EI BIOS 55400 868 06 00 0023 Quiet Boot Disabled Post Error Pause Disabled Main System Date Current date System Time Current local time Serial ATA Enabled Advanced Enhanced Intel Speedstep Enabled Core Multi Processing Enabled Intel R Virtualization Technology Disabled Intel VT for Directed I O Disabled Processor Simulated MSI support Disabled Configuration Execute Disable Bit Disabled Hardware Prefetcher Enabled Adjacent Cache Line Prefetch Enabled IOAT2 enable Enabled Processor Retest Disabled Memory RAS amp performance Memory RAS configuration RAS Disabled Memory Snoop Filter Enabled Configuration FSB High Bandwith Enabled Optimisation Onboard PATA Controller Enabled Onboard SATA Controller Enabled ATA SATA Mode Enhanced Configuration AHCI Mode Disabled Configure SATA as RAID
104. example COMI BAS5 for Xeon Maintenance Guide e Commands files directories and other items whose names are predefined by the system are in Bold as shown below The etc sysconfig dump file e use of Italics identifies publications chapters sections figures and tables that are referenced e lt gt identifies parameters to be supplied by the user for example node name WARNING A Warning notice indicates an action that could cause damage to a program device system or data CAUTION A Caution notice indicates the presence of a hazard that has the potential of causing moderate or minor personal injury Preface iii iv BAS5 for Xeon Maintenance Guide Table of Contents dri Fe i Chapter 1 Stopping Starting Procedures 1 1 1 1 Stopping Restarting a eme deri tend Funes dL Un uid ners 1 1 tll Stopping a Node Um 1 1 Restarting d e E 1 2 1 2 Stopping Restarting an Ethernet 1 3 1 3 Stopping Restarting Backbone Switch 1 3 1 4 Stopping Restarting the HPC Cluster EY Pri etre 1 4 38EE 0000 1 9197 c 1 4 1 4 2 Starting the HPC a arsenic 1 4 Chapter 2 Day to Day Maintenance Operations
105. gt sel list e display the MAC address of the BMC ipmitool I lan H ip addr gt raw 0x06 0x52 0xa0 0x06 0x08 Oxef know more about the ipmitool command enter ipmitool h BAS5 for Xeon Maintenance Guide 2 2 2 Stopping Starting the Cluster nsclusterstop nsclusterstart The nsclusterstop nsclusterstart scripts are used to stop or start the whole HPC cluster These scripts launch in sequence the various stages making it possible to stop start the cluster in full safety For example the stop process includes the following main steps e checking the various equipment e stopping the file systems Lustre for example e stopping the storage devices e stopping the nodes except the Management nsclusterstop and nsclusterstart use two configuration files etc clustmngt nsclusterstart conf and etc clustmngt nsclusterstop conf files whose values can be changed The file option allows you to specify another configuration file These files define e delay parameters between the different stages required to stop start the cluster e sequence in which the group of nodes should be stopped started You run dmbGroup show to display the configured groups Usage usr sbin nsclusterstop h f file lt filename gt usr sbin nsclusterstart h lt filename gt Options file lt filename gt f
106. h command 3 10 ibdoctor 2 38 ibnetdiscover command 3 9 IBS command 2 22 availability action 2 30 bandwidth action 2 25 config action 2 28 dbcreate action 2 28 dbdelete action 2 29 dbpopulate action 2 29 dbupdate action 2 29 dbupdatepc action 2 30 E option 2 22 errors action 2 25 group action 2 28 group csv file 2 28 N option 2 22 topo action 2 23 IBS tool 2 22 NetworkMap xml 2 22 portcounters sav 2 22 IBSDB Database 2 22 2 28 ibstat command 2 20 ibstatus command 2 20 ibtracert 2 39 Infiniband status 2 20 2 BASS for Xeon Maintenance Guide InfiniBand switch firmware update 4 1 INTEL_LMD_DEBUG environment variable 3 27 ioshowall command 3 20 ipmitool using 2 4 K Kernel problems 2 43 L command 3 21 licenses 3 27 Imdiag command 3 27 Ipflash command 2 14 Iptools 2 14 Iputil command 2 14 lsiocfg command 2 34 Ismod command 2 14 Lustre HA troubleshooting 3 18 Lustre failover service 3 18 lustre_check command 3 18 lustre HA DBDaemon log file 3 20 lustre HA ddmm log 3 21 lustre HA ddmm log file 3 20 lustre 3 22 lustre migrate 3 22 lustre migrate hastat command 3 18 lustre migrate nodestat command 3 18 lustre util command 3 19 M macros use in file names 2 11 maintenance tools 2 1 MegaCLl tool 5 1 MegaRAID card firmware update 5 1 Mellanox card 4 1 memory size 2 44 mkCDrec 2 16 mkcdrec Config sh file
107. he maximum size for core dump files for Bull HPC systems is set to O which means that no resources are available and core dumps cannot be done In order that core dumps can be done the values for the ulimit command have to be changed For more information refer to the options for the ulimit command in the bash man page Identifying InfiniBand Network Problems ibdoctor ibtracert ibdoctor is Bull tool which calls on the ibtracert ibnetdiscover and smpquery diagnostic tools whilst at the same time interfacing with the ClusterDB database so that any problems in the InfiniBand network can be identified easily ibdoctor Command ibdoctor may be used e identify where any problem adapters or nodes are located e display communication paths including bandwidth between ports in a human readable format Options s src lid Use specified source lid d dst lid Use specified destination lid 4 Trace route between src lid and dst lid T Report the fabric state over all known routes h Help Example display status data for the path between two InfiniBand adapters with the local identifiers 0x14 and 1 enter ibdoctor t s 0x14 d Oxle The output looks as follows 11 0x11 port 2 guid 0008f10400411da2 state Active width rate 11 0x11 port12 guid 0008f10400411da2 state Active width rate e T option completes an exhaustive scan of the network and traces
108. ial B None 115 2k VT100 Disabled 0 PATA DVD if present IBA GE Slot 600 v1240 SATA 0 EFI shell SATA 0 SATA 1 SATA 2 IBA GE Slot 600 v1240 Disabled 7 4 4 NovaScale R422 BIOS Settings motherboard X7DBT X7DGT RA22 BIOS 1 3c BIOS setup section System Time lt Current local time gt System Date lt Current date gt Main Serial ATA Enabled Native Mode Operation Serial ATA SATA Controller Mode Option Compatible Advanced QuickBoot Mode Enabled QuietBoot Mode Disabled POST Errors Disabled ACPI Mode Yes Boot Features Power Button Behaviour Instant Off Resume On Modem Ring Off Power Loss Control Stay Off Watch Dog Disabled Summary screen Disabled Cache System BIOS area Write Protect Cache Video BIOS area Write Protect Memon Coche Cache Base 0 512k Write Back Cache Base 512k 640k Write Back Cache Extended Memory Area Write Back Discrete MTRR Allocation Disabled Onboard G LAN1 OPROM Configure Enabled Onboard G LAN2 OPROM Configure Disabled Default Primary Video Adapter Onboard Emulated IRQ Solution Disabled PCl e Performance Payload 2568 PCI Parity Error Forwarding Disabled ROM Scan Ordering Onboard First Reset Configuration Data No Option ROM Scan Enabled SLOT1 PCI Exp x8 Enable Master Enabled Latency Timer Default Large Disk Access Mode PCI Configuration Advanced Chipset Control SERR signal condition Single bit AGB PCI Hole Granularity
109. ile provides a trace of the commands issued by the nodes to update the LDAP and ClusterDB databases This information should be compared with the actions performed by CSS Note In lustre HA ddmm log dd specifies the day and mm the month of the creation of the file var log lustre HA DBDaemonzyy mm dd log This file provides a trace of any ClusterDB updates that result from the replication of LDAP This could be useful if Lustre debug is activated at the same time On the Nodes of an I O Pair The following tools must be run from the I O nodes ioshowall This command allows the configuration to be checked Look at the etc cluster cluster conf file for any problems if the following error is displayed cannot connect to lt PAP address gt or HWMANAGER Check if the node is an inactive pair node if the following error appears otherwise start the node again service lustre ha inactif clustat Displays a global status for Cluster Suite 4 from the HA cluster point of view 2 If there is a problem the two pair nodes may not have the same view of the HA cluster state storioha c status This command checks that all the Cluster Suite 4 processes are running properly running state Notes e This command is equivalent to the following one on the Management Node stordepha c status i node e This command is included in the global checking performed the ioshowall
110. ing Disabled Onboard Video VGA Controller Enabled Controller Onboard VGA Option ROM Auto Scan Onboard LAN LAN Controller Enabled LAN1 Option ROM Scan Enabled LAN2 Option ROM Scan Enabled PCI Slot 1B Option ROM Enabled PCI Slot 1C Option ROM Enabled Serial port A Enabled Base I O address 3F8 Interrupt IRQ 4 Advanced Serial port B Enabled Peripheral Base I O address 2F8 Configuration Interrupt IRQ 3 USB 2 0 Controller Enabled Parallel ATA Enabled Serial ATA Enabled SATA Controller Mode Option Compatible Multimedia Timer Enabled Intel R I OAT Enabled Wake On LAN PME Enabled Wake On Ring Disabled Wake On RTC Alarm Disabled BooHime Diagnostic Screen Enabled PCI Configuration Advanced Chipset Control Reset Configuration Data No NumLock On Memory Processor Error 7 24 BAS5 for Xeon Maintenance Guide Supervisor Password 15 Clear Security Password on boot Disabled Power Switch Inhibit Disabled ACPI Redirection Port Disabled le Redirecti Console Redirection Flow Control m Remote Console Reset Enabled Assert NMI on SERR Enabled Server nes Boot Monitoring Disabled Thermal Sensor Enabled Post Error Pause Enabled Power On 76 Boot Slot v1236 Managing the BIOS on NovaScale R4xx Machines 7 25 7 4 9 NovaScale R440 SAS BIOS Settings System part number N8100 1243E R440 SAS BIOS 5546 Motherboard Jumper settings JSASRAID2 1 2 RAID disable BIOS setup s
111. ion In particular it can be used to examine system information for different tasks the state of the memory allocation etc See the proc man page for more information e In the event of a system crash memory will be written to the configured disk location using kdump Upon subsequent reboot the data will be copied from the old memory and formatted into a vmcore file and stored in the var crash subdirectory The end result can then be analysed using the crash utility An example command is shown below crash usr lib debug lib modules lt kernel_version gt vmlinux vmcore Configuring systems to take dumps from the Management Network In addition to forcing a dump for a kernel crash it is possible to force a dump using the ipmitool command from the Management Node This is done as follows Add nmi_watchdog 0 to the kernel boot options in the boot grub menu Ist file in order to deactivate the NMI watchdog used by RHEL so that the other NMls can be put into effect An example of the menu 1st file is shown below kernel vmlinuz 2 6 18 53 d5 ELsmp ro root LABEL nmi_watchdog 0 console tty0 console ttyS1 115200n8 console ttyS0 1152 00n8 rhgb quiet Once the system has been restarted the kernel has to be reconfigured so that a panic is launched when an unknown NMI is received This can be set to happen automatically by configuring the kernel unknown nmi panic 1 option in the etc sysct1 conf file Alternatively this can
112. ion ROM Enabled PCI Slot 3C Option ROM Enabled Enabled Serial port A Base I O address 3F8 Interrupt IRQ 4 Enabled Peripheral Serial port B Base I O address 2F8 Configuration Interrupt IRQ 3 USB 2 0 Controller Enabled Parallel ATA Enabled Serial ATA Enabled SATA Controller Mode Option Compatible Advanced Chipset Multimedia Timer Enabled Control Intel R I OAT Enabled Wake On LAN PME Enabled Wake On Ring Disabled Wake On RTC Alarm Disabled Boottime Diagnostic Screen Enabled Onboard Video Controller PCI Configuration Reset Configuration Data No 7 28 BAS5 for Xeon Maintenance Guide Supervisor Password 15 User Password 15 Password on boot BIOS Redirection Port ACPI Redirection Port Baud Rate Flow Control Terminal Type Remote Console Reset Assert INMI on PERR Assert NMI on SERR FRB 2 Policy Boot Monitoring Boot Monitoring Policy Thermal Sensor BMC IRQ Post Error Pause AC LINK Power On Delay Time nitions Event Filtering Clear Clear Disabled Serial Port B Disabled 115 2K None VT100 Enabled Enabled Enabled Retry 3 Times Disabled Retry 3 Times Enabled IRQ 11 Enabled Stay Off 20 Enabled USB FDC USB CDROM USB KEY IDE CD PCI BEV IBA GE Slot OCOO v1236 PCI SCSI Managing the BIOS on NovaScale R4xx Machines 7 29 7 4 11 System BIOS QSFX74400925 Version 1 0 2015 release 1 15 NovaScale RA80 E1 BIOS Settings R480 Date 10 31 2007 BIOS setup section
113. irmware and the tool needed to carry out the upgrade are included on the following RPM update bmc fw lt BMC firmware version gt Bull x86_64 rpm The BMC firmware of the SIMSO board can be updated under Linux using the updatefw x86_64 command To update the BMC firmware on the local machine do the following 1 Install the update bmc fw lt fw version rpm onto the machine 2 Start the IPMI service if it has not already been started service ipmi start 3 Run the command below updatefw x86 64 f usr local firmware firmware bin Where firmware is ubsim BMC FW version for a SIMSO board ugsim BMC FW version for SIMSO with KVM board 4 Toinitialize the Sensor Date Repository SDR on the local machine sdrload usr local firmware platform sdr dat where platform equals either r421 r422 for NovaScale R422 and R422 1 machines or R423 To update the BMC firmware on a remote machine do the following 1 Install the update bmc fw lt fw version rpm onto the local machine 2 Run command below updatefw x86 64 i IP Address u ADMIN p ADMIN usr local firmware firmware bin Where firmware is ubsim BMC FW version for a SIMSO board ugsim lt BMC FW version for SIMSO with board 6 4 BAS5 for Xeon Maintenance Guide 6 3 3 Toinitialize the SDR on the remote machine sdrload usr local firmware platform sdr dat BMC IP Ad
114. le script is the width check script which allows you to easily check the fabric for 1X connections links While the fabric will work over a 1X connection it will however create a bottleneck and hurt performance within the fabric All links should report no 1X connections when the script is ran Nothing else will be reported other than the LID and GUID if it s a full 4X link ISR9288 utilities width check Verify every error found will be printed lid 1 guid 0008f104004004d7 ports 24 lid 5 guid 0008f104003f0723 ports 24 lid 4 guid 0008f104003f0722 ports 24 lid 3 guid 0008 104003 071 ports 24 lid 2 guid 0008 104003 071 ports 24 lid 11 guid 0008 104003 0747 ports 24 lid 10 guid 0008 104003 0746 ports 24 lid 7 guid 0008f104003f073b ports 24 Troubleshooting 3 3 3 1 5 3 3 1 6 ex error find script The easiest way to look for errors on all ports in the fabric is to run the error find script It will report any non zero port counters found throughout the fabric on both switches and HCAs ISR9288 utilities error find Show All Counter Errors every error found will be printedlid 1 guid 0008 104004004d7 ports 24 lid 5 guid 0008 104003 0723 ports 24 port 22 xmitdiscards ssi ee n 4 port 20 Irnkdownedt 2Gxeui 1 port 13 lid 4 guid 0008 104003 0722 ports 24 port 14 GF FS aS YM 83 Event Notification Mechanism Fabric
115. led Base I O address Serial port A 3F8 Interrupt Serial port A IRQ 4 Device Serial port B Enabled Configuration Mode Normal Base I O address Serial port B 2F8 Interrupt Serial port B IRQ 3 Parallel Port Disabled Floppy disk controller Disabled ECC Event Logging Enabled Com Port Address On board COM B Baud Rate 115 2K Console Type VT100 Flow Control None Console Redirection Console connection Direct Continue C R after POST On Fan Speed Control Modes 1 Disable Full speed System Event Logging Enabled Clear System Event Log Disabled 7 22 BAS5 for Xeon Maintenance Guide BIOS POST Errors Enabled OS boot Watchdog Disabled E Subnet Mask Supervisor Password Is Security Disabled es USB FDC Boot Managing the BIOS on NovaScale RAxx Machines 7 23 7 4 8 NovaScale R440 SATA BIOS Settings System part number N8100 1241E R440 SATA BIOS 5536 Motherboard Jumper settings JSASRAID2 1 2 RAID disable BIOS setup section Parameter Value System Time Current local time System Date Current date Hard Disk Pre Delay Disabled Primay IDE Master Type Auto 32 Bit I O Enabled Processor Settings Processor Retest No Execute Disable Bit Disabled Intel R Virtualization Tech Disabled Enhanced Intel SpeedStep R Disabled Tech Language English US Memory Retest No Memory Extended RAM Step Disabled Configuration Memory RAS Feature Interleave Spar
116. lso checks etc vsftpd ftpusers for users that are denied root bin 3 Start the vsftpd server as follows root host service vsftpd start Starting vsftpd for vsftpd OK 1 4 Check that FTP is working correctly root host ftp host BAS5 for Xeon Maintenance Guide 4 2 2 Connected to host 220 vsFTPd 2 0 1 530 Please login with USER and PASS 530 Please login with USER and PASS KERBEROS V4 rejected as an authentication type Name host root root 331 Please specify the password Password 230 Login successful Remote system type is UNIX Using binary mode to transfer files ftp quit 221 Goodbye Configuring the FTP server options for the InfiniBand switch Enter the FTP configuration menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname connecting switchname config switchname config ftp switchname config ftp i The following settings define the node 172 20 0 102 as the FTP server The switch logs onto this server using Joe s account using yummy password switchname config ftp server 172 20 0 102 switchname config ftp username joe switchname config ftp password yummy Once FTP is set up on the switch make sure the FTP server is running on the Management Node ftp host If ftp fails to connect to the host as in the example above it probably mea
117. ly once Bull Advanced Server has been installed on the Management Node and the different node image files have been deployed 2 3 1 Installing BSBR BSBR uses the RPM included on the Bull Extension Pack CD To install BSBR insert the Bull Extension Pack CD in the drive and type the following commands cd lt mntdir gt tools mkcdrec install sh where mntdir mount directory for DVD CD see etc fstab Note X Ignore the warning message related to Webmin 2 16 BAS5 for Xeon Maintenance Guide 2 3 2 Configuring BSBR The var opt mkcdrec Config sh file contains the configuration parameters for BSBR All parameters have a default value By default BSBR is configured to save on DVD or NFS The following values should be checked either to verify that they fit your needs or to define values specific to your cluster ISOFS DIR The temporary directory where all the files to be backed up are stored The default is tmp backup WARNING The content of the ISOFS_DIR will be wiped out when the make clean command is used CDREC ISO DIR The location where CDrec iso ISO9660 images will be made Default is tmp KERNEL APPEND Add the nmi_watchdog 0 parameter in the variable as in the following example KERNEL APPEND nmi watchdog 0 EXCLUDE LIST List of directories which should be excluded during backup Of course it also means they cannot be restored too Default is tmp proc mnt IMPORTANT Assuming th
118. machines See Bull NovaScale RA2x AOC SIMSO SIMSO Installation and User s Guide for more information 6 2 BASS for Xeon Maintenance Guide The Web interface provides access to the SOL console or the KVM console SIMSO and also the means to access virtual devices for maintenance purposes To access the BMC of a remote machine through the Web interface 1 The following RPMs found in the BONUS directory on the Bull XHPC DVD must be installed on the Management Node XHPC BONUS jre lt version gt linux i586 rpm XHPC BONUS firefox lt version gt Bull 0 i386 rpm These are installed by running the commands below cd release XBAS5V1 1 XHPC BONUS rpm i jre version linux i586 rpm firefox version Bull 0 i386 rpm 2 The java plug in should be configured using Firefox s usr java jrel version plugin i386 ns7 libjavaplugin oji so usr local firefox plugin 3 The remote BMC is accessed using the command below usr local firefox firefox 4 n the navigation bar enter the URL http BMC IP addr Accessing Updating and Reconfiguring the BMC Firmware on NovaScale R4xx machines 6 3 6 2 Updating the BMC Firmware on NovaScale R421 R422 R422 and R423 machines These platforms use the BMC SIMSO or SIMSO add on boards for platform management Both boards provide IPMI 2 0 functions The SIMSO board provides additional KVM over LAN functionality The BMC f
119. me of the program that has generated the message host regexp To filter by the regular expression of the name of the host that has sent the message match regexp To filter by a regular expression filter filtername To use another filter All keywords may be used several times The expressions can contain the AND OR and NOT operators Examples filter f iptables match IN OUT MAC filter f snort match snort filter f full not filter f snort AND NOT filter f iptables filter f messages level info warn AND NOT facility auth authpriv mail news log Section In this section you define how the messages will be processed using source destination and filters commands defined in the previous sections BAS5 for Xeon Maintenance Guide Syntax log source s1 source s2 filter F1 filter f2 destination d1 destination d2 flags flag 1 flag2 Examples log source src filter f news filter f notice destination newsnotice log source src destination full Day to Day Maintenance Operations 2 13 2 2 6 2 2 6 1 2 2 6 2 2 14 Upgrading Emulex HBA Firmware with Iptools Iptools is a set of two utilities for upgrading Emulex HBA firmware These two utilities are lputil low level tool used to interact with Emulex HBA e lpflash high level script used to upgrade firmware of a set of Emulex Emulex driver Ipfc mo
120. modules loaded This is confirmed by the following lines in the system logs of the machine from which the problem is coming ustreError 11602 0 o2iblnd c 1569 kiblnd startup Can t query IPoIB interface 1 0 it s down uustreError 105 4 Error 100 starting up LNI o2ib Please pay particular attention to the fact that the IPoIB interface has to be fully functional in order to start and run Lustre Despite that fact that Lustre data is not transmitted on the IPoIB interface IPoIB is used by Lustre to create and manage Infiniband connections Troubleshooting 3 17 3 6 3 6 1 3 18 Lustre File System High Availability Troubleshooting Before using a Lustre file system configured with the High Availability HA feature or in the event of abnormal operation of services it is important to perform a check up of the Lustre HA file system This section describes the tools that allow you to make the required checks On the Management Node The following tools must be run from the management node lustre check This command updates the lustre io nodes table in the ClusterDB The lustre io nodes table provides information about the availability and the state of the I O nodes and metadata nodes lustre migrate nodestat This command provides information about the node migrations carried out It indicates which nodes are supposed to support the OST MDT services In the following example the MDS
121. ng complete nodelP or var lib systemimager overrides patching complete nodelP or var lib systemimager overrides unpatching complete nodelP and ksis client traces on the Management Node in var lib systemimager overrides imaging complete error nodelP These traces will only be logged if the deployment error occurs on the client side Patch deployment client traces on the Management Node in var lib systemimager overrides patching complete error nodelP or var lib systemimager overrides unpatching_complete_error_ lt nodelP gt The client log files will be used during the post check phase Ksis client and image server errors are compared in order to identify the source of any problems which may occur The trace files are kept for support operations BASS for Xeon Maintenance Guide 3 4 Storage Troubleshooting This section provides some tips to help the administrator troubleshoot a storage configuration 3 4 1 Management Tools Troubleshooting 3 4 1 1 Verbose Mode v Option Some of the storage commands have a verbose option which provides more output information during the processing of the command See 55 for Xeon Administrator s Guide for an inventory of storage commands supporting the v option 3 4 1 2 Log Trace System Principle If the verbose mode is not enough a system of traces can also be configured to obtain more information on some commands To activate these traces you c
122. ns that the FTP server has not been installed on the host ftp connect Connection refused ftp gt quit Updating the firmware for the InfiniBand switches 4 3 43 Upgrading the firmware In the following example it is assumed that the end user stored the firmware in the existing path to firmware directory 1 Extract the firmware archive to the path to firmware directory as follows cd path to firmware tar xvf Ver 10 06 fw 1 0 0 tar voltaire fw images tar voltaire fw ini tar howto upgrade voltaire switch txt 2 Once the firmware has been extracted log on to the switch and proceed with the upgrade a Upgrading the firmware for the whole switch user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname update firmware chassis lt path_to_firmware gt b Upgrading the firmware for a specific line board line board 4 in the example below user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname connecting switchname update firmware line 4 lt path_to_firmware gt c Upgrading a fabric board fabric board number 2 in the example below user host ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname update firmware spine 2 path to firmware Note Whenever
123. nstalled do remove first 2 Run the following command to remove 51 file system lustre util remove f fsl The command may fail with a message similar to file system not loaded try to give the full path If it is not possible to re install neither remove the file system with force option BAS5 for Xeon Maintenance Guide The lustre fs dba command can then be used to remove the file system information from the cluster management database For example to remove the fs1 file system description from the cluster management database enter the following command lustre fs dba del f fs1 After this command the file system can be re installed using the lustre util install command 3 5 4 Cannot create file from client If you get the following error message when you try to create new file from a Lustre client it simply means that the user UID you use to create the file is not recognized by the Lustre filesystem touch cannot touch mnt lustre myfile Identifier removed To avoid such problems all users UID that exist on the Lustre client nodes must also exist on the MDS server 222 No such device If the start of the Lustre filesystem fails with the following message most of the time it is due to the fact that Infiniband is not properly configured on the Lustre nodes mount lustre mount dev ldn lustrefda2500 4 at mnt srv lustre scratch scratch OST0003 failed No such device Are the lustre
124. nt contribution towards maintaining a cluster in a stable and reliable condition This chapter is aimed at helping you to develop a general comprehensive methodology for identifying and solving problems on and offsite The following topics are described e 3 1 Troubleshooting Voltaire Networks 3 2 Troubleshooting InfiniBand Stacks e 3 3 Node Deployment Troubleshooting e 3 4 Storage Troubleshooting e 3 5 Lustre Troubleshooting e 3 6 Lustre File System High Availability Troubleshooting e 3 7 SLURM Troubleshooting e 3 8 FLEXIm License Manager Troubleshooting 2 1 Troubleshooting Voltaire Networks 3 1 1 Voltaire s Fabric Manager Voltaire s Fabric Manager enables InfiniBand fabric connectivity debugging using the built in Performance Manager has two major capabilities Port Counters Monitoring and Report The PM generates a periodic port counters report file in CSV format that can be loaded to Excel and further analyzed by the user It also monitors port counters errors and reports every port that passes its error threshold limit as configured by the user Event Logging This creates an event log file for both IB traps and SubNet internal events The user may filter the events using a GUI and or a CLI The filtering policy determines whether an event is logged and whether a trap is generated It is essential to identify any problem ports and node connectivity problems prior to running application as well as d
125. ommand 2 38 FLEXIm License Manager 3 27 updatefw x86 64 command 6 4 Infiniband 3 5 Lustre HA 3 18 V SLURM 3 23 storage 3 13 Voltaire switch firmware 4 1 Voltaire 3 1 Voltaire Switches 2 31 4 BASS for Xeon Maintenance Guide BULL CEDOC 357 AVENUE PATTON B P 20845 49008 ANGERS CEDEX 01 FRANCE REFERENCE 86 A2 90EW 01
126. ommand must be run from the Management Node The tasks can be performed on any type of node Compute Node I O Node etc except the Management Node Usage usr sbin nsctrl options action lt nodes gt General Options debug Debug mode more than verbose dbname name Specify database name force f Do not ask for confirmation or state checking group g Specify a group of nodes You can use the dbmGroup show command to display the defined groups help h Display nsctrl help interval i Specify the number of nsm calls before waiting the period defined by the time option jobs Number of simultaneous actions for example with 5 you can run 5 simultaneous nsmpower processes Default 30 test Display the NS Commands that would be launched according to the specified options and action This is a testing mode no action is performed time t Time to wait after the number of nsm calls defined by the interval option verbose v Verbose mode Specifying nodes The nodes are specified as follows basename i j k If no nodes are explicitly specified nsctrl uses the nodes defined by the or group option Actions poweron poweroff poweroff force reset status ping Day to Day Maintenance Operations 2 7 Examples Note In the following examples the o option only test is used to display which NS Commands would be launched for
127. on MPTFW 01 15 20 00 IT FW Version 1 02 00 0119 WebBIOS Version 1 01 24 Ctrl R Version i 1 02 007 Pending Images In Flash Note The following MegaRAID card details are also provided when the AdpAllInfo command runs PCI slot info Hardware Configuration Settings and Capabilities for the card Status Limitations Devices present Virtual Drive and Physical Drive Operations supported by the card Error Counters and Default Card Settings 2 Decompress and extract the firmware by running the command below unzip 1si 5 1 1 0054 SAS FW Image 1 03 60 0255 zip Updating the firmware for the MegaRAID card 5 Archive root 1si 5 1 1 0054 SAS FW Image 1 03 60 0255 zip inflating sasfw rom inflating 5 1 1 0054 SAS FW Image 1 03 60 0255 txt extracting DOS MegaCLI 1 01 24 zip 3 Update the firmware using the MegaCLl tool using the command below opt MegaCli adpfwflash f sasfw rom a0 Adapter 0 MegaRAID SAS 8408E Vendor ID 0x1000 Device ID 0x0411 FW version on the controller 1 02 00 0119 FW version of the image file 1 03 60 0255 Flashing image to adapter Adapter 0 Flash Completed 4 Reboot the server so that the new firmware is activated for the card 5 2 BAS5 for Xeon Maintenance Guide Chapter 6 Accessing Updating and Reconfiguring the BMC Firmware on NovaScale RAxx machines This chapter describes how to update the BMC firmware on NovaScale RA21 RA22 R422
128. on on the all tools described in this section and also on the other OpenlB tools which are available 3 10 BAS5 for Xeon Maintenance Guide 3 3 3 3 1 3 3 2 3 3 2 1 Node Deployment Troubleshooting ksis is the deployment tool used to deploy node images on Bull HPC systems This section describes how deployment problems are logged by ksis for different parts of the deployment procedure ksis deployment accounting Following each deployment ksis take stock of the nodes and identifies those that have had the image successfully deployed onto them and those that have not This information is listed in the files below and remains available until the next image deployment e list of nodes successfully deployed to tmp ksisServer ksis nodes list e of nodes not deployed to tmp ksisServer ksis exclude nodes list When the image has failed to be deployed to a particular node Ksis adds a line in the ksis exclude nodes list file to indicate The name of the node between square brackets The consequences of the problem for the node Three states are possible not touched The node was excluded the deployment with no impact for the node restored The configuration of the node was modified but its initial configuration was able to be restored corrupt The node was corrupted by the operation c The circumstance which led to the deployment problem Example node2 not touched node is configur
129. operating state or a reconfiguration Create Delete normal system event Multicast group Applied routing scheme Port State Change BAS5 for Xeon Maintenance Guide 3 2 Troubleshooting InfiniBand Stacks A suite of InfiniBand diagnostic tools are provided with the Bull Advanced Server There exists a hierarchical dependency for these tools as shown in the diagram below For example ibchecknet is dependent on ibnetdiscover ibchecknode ibcheckport and ibcheckerrs READ ONLY PROGRAMS READ WRITE PROGRAMS Figure 3 1 OpenlB Diagnostic Tools Software Stack Use the following command to launch the diagnostic tools openib diags ibstatus ibtracert and ibdoctor a tool developed by Bull are described in chapter 2 Day to Day Maintenance Operations Some of the more useful troubleshooting tools are described below 3 2 1 smpquery Subnet Manager Query smpquery includes a subset of standard SMP query options which may be used to bring up information in a human readable format for different parts of the network including nodes ports and switches The basic syntax for the command is as follows smpquery options op dest addr op params nodeinfo example An example of use of this command including the Local ID and the port number is below smpquery nodeinfo 45 1 Troubleshooting 3 5 The resulting information output will be similar to that displayed below BAS OV GE i
130. ory space etc compared to the parameters specified in the slurm conf file then either fix the node or change slurm conf For example if the temporary disk space specification is TmpDisk 4096 but the available temporary disk space falls below 4 GB on the system SLURM marks it as down 2 If the reason is Not responding then check the communication between the Management Node and the DOWN node by using the following command ping address Check that the address specified matches the NodeAddr values in the slurm conf file If ping fails then fix the network or the address in the slurm conf file 3 24 BAS5 for Xeon Maintenance Guide 3 Login to the node that SLURM considers to be in a DOWN state and check to see if the slurmd daemon is running using the following command ps ef grep slurmd 4 If slurmd is not running restart it as the root user using the following command service slurm start 5 Check SlurmdLogFile file in the slurm conf file for an indication of why it failed a If slurmd is running but not responding a very rare situation then kill and restart it as the root user using the following commands Service slurm stop service slurm start 6 If the node is still not responding there may be a Network or Configuration problem see section 3 7 5 Networking and Configuration Problems 7 Ifthe node is still not responding increase the verbosity of debug messages by
131. ounters from lid 32 port 1 enter perfquery 32 1 e read node aggregated performance counters enter perfquery a 32 e read performance counters and reset enter perfquery r 32 1 e reset performance counters of port 1 only enter perfquery R 32 1 e reset performance counters of all ports enter perfquery R a 32 e reset only non error counters of port 2 enter perfquery R 32 2 0 000 Example output The resulting information output will be similar to that displayed below Port counters Lid 45 port 2 4 er ent e Uk s edes 2 CounterSelect 0x0000 OVymboJEEEOESIiu dpa am e 0 L23nkRSCOVGEST J exqsem e 0 0 E 0 RcvRemotePhysErrors 0 0 XmtDiscabrds eskx eoe 2 0 RevConstraintErrOrgS ee2 0 LinklntegrityErrofst 4 es 0 ExcBufOverru nEFrrOrsiissoe d ss e ey 0 BAS5 for Xeon Maintenance Guide MME BV CSS n ise o reyes 458424 ew Ue wea eR Rcs 1908363 DOME PES tga ws ER exe rus 6367 41748 ibnetdiscover and ibchecknet ibnetdiscover is used to scan the topology of the s
132. primary state to pair node state run lustre migrat xport n node name Or to reset the switched node back to its primary state run lustre migrate relocat n node name 4 Re connect the Lustre File System to the Lustre HA system lustre ldap active f fsl 3 22 BASS for Xeon Maintenance Guide 3 7 SLURM Troubleshooting 3 7 1 SLURM does not start Check that all the RPMs have been installed on the Management Node by running the command below rpm qa grep slurm The following RPMs should be listed slurm x x xx x Bull slurm auth none x x xx x Bull pam_slurm x x x x xx x Bull slurm auth munge x x xx x Bull Note version numbers depend on the release and are indicated by the letter x above 3 7 2 SLURM is not responding 1 Run the command scontrol ping to determine if the primary and backup controllers are responding 2 If they respond then there may be a Network or Configuration problem see section 3 7 5 Networking and Configuration Problems 3 If there is no response log on to the machines to rule out any network problems 4 Check to see if the slurmctld daemon is active by running the following command ps ef grep slurmctld a If slurmctld is not active restart it as the root user using the following command service slurm start b Check SlurmctldLogFile file in the slurm conf file for an indication of why it failed
133. r Console Redirection BIOS Redirection Port ACPI Redirection Port Baud Rate Flow Control Terminal Type Remote Console Reset Assert INMI on PERR Assert NMI on SERR FRB 2 Policy Boot Monitoring Boot Monitoring Policy Thermal Sensor BMC IRQ Post Error Pause AC LINK Power On Delay Time Platform Event Filtering Serial Port B Disabled 115 2K None VT100 Enabled Enabled Enabled Retry 3 Times Disabled Retry 3 Times Enabled IRQ 11 Enabled Stay Off 20 Enabled USB FDC USB CDROM USB KEY IDE CD PCI BEV IBA GE Slot OCOO v1236 PCI SCSI Managing the BIOS on NovaScale R4xx Machines 7 27 7 4 10 NovaScale R460 BIOS Settings System part number N8100 1247E R460 BIOS 5S46 Motherboard Jumper settings JSASRAID2 1 2 RAID disable BIOS setup section Value System Time Current local lime System Date Current date Hard Disk Pre Delay Disabled Processor Retest No Execute Disable Bit Disabled Processor Settings Intel R Virtualization Tech Disabled Enhanced Intel SpeedStep R Disabled Tech Advanced Memory Retest No Memory Extended RAM Step Disabled Configuration Memory RAS Feature Interleave Sparing Disabled VGA Controller Enabled Onboard VGA Option ROM Auto Scan LAN Controller Enabled Onboard LAN LAN1 Option ROM Scan Enabled LAN2 Option ROM Scan Enabled PCI Slot 1B Option ROM Enabled PCI Slot 1C Option ROM Enabled PCI Slot 2B Option ROM Enabled PCI Slot 2C Option ROM Enabled PCI Slot 3B Opt
134. r your shell script can cause many kinds of errors check these files first when something goes wrong First be sure your File system is mounted and you have mandatory user rights Hung Nodes There is no way to clear a hung node except by rebooting If possible un mount the clients shut down the MDS and OSTs and shut down the system Suspected File System Bug If you have rebooted the system repeatedly without following complete shutdown procedures and Lustre appears to be entering recovery mode when you do not expect it take the following actions to cleanly shut down your system 1 Stop the login nodes and all other Lustre client nodes Include the F option with the lustre util command to un mount the file system lustre_util umount F f file system n node name 2 Shutdown the rest of the system 3 Run the e2fsck command Cannot re install a Lustre File System if the status is CRITICAL If the status of a file system is CRITICAL according to the lustre util status command and if the file system needs to be re installed for instance if some nodes of the cluster have been deployed and reconfigured it is possible that the file system description needs to be removed from the cluster management database as shown below 1 Run the following command to install the s1 file system lustre util install f etc lustre models fsl lmf The command may issue an output similar to file system already i
135. s switchname utilities Once in the utilities menu check which firmware version is installed switchname utilities firmware verify anafa II Scan Fabric Default fw version is 00 08 06 Updating the firmware for the InfiniBand switches 4 1 4 2 4 2 1 4 2 Configuring FTP for the firmware upgrade If the switch firmware requires an upgrade the FTP options for the switch will need to be set These may already be in place following the initial Installation and Configuration of the cluster If not they are put into place as follows Installing the FTP Server To install the FTP server vsftpd proceed as follows rpm ivh path to vsftpd version arch rpm By default the vsftpd daemon will not allow root access to the FTP server For security reasons it is advised to create a dedicated user for this purpose However if you wish to enable root access to the FTP server vsftpd can be enabled to allow this as follows 1 Edit etc vsftpd ftpusers file and comment out the line that starts by root as shown below Users that are not allowed to login via ftp root Bin 2 Edit etc vsftpd ftpuser list and comment out the line that starts by root as shown below etc vsftpd user list vsftpd userlist If userlist deny NO only allow users in this file If userlist deny YES default never allow users in this file and do not even prompt for a password Note that the default vsftpd pam config a
136. s of transfer rate v enable verbose mode which includes all sysfs supported parameters for the port interface and port Syntax ibstatus h devname port Examples e display status of all IB ports enter ibstatus e display status of mthcal ports enter ibstatus mthcal e show status of specified ports enter ibstatus mthcal 1 mthca0 2 Output example for a mthca dual port HCA Infiniband device mthca0 port 1 status default gid 80 0000 0000 0000 0008 104 0397 7 5 base lid 0x0 sm lid 0x0 state 1 DOWN phys state 2 Polling rate 2 5 Gb sec 1X Infiniband device mthca0 port 2 status default gid f e80 0000 0000 0000 0008 104 0397 7ca6 base lid 0x2d sm lid 0x3 state 4 ACTIVE phys state 5 LinkUp rate 10 Gb sec 4X BAS5 for Xeon Maintenance Guide 2 4 1 2 ibstat Command ibstat works in a similar fashion to the ibstatus utility but is implemented as a binaries and not a script and is more useful than ibstatus as more detailed information is provided It includes options to list Channel Adapters and or Ports Syntax ibstat d ebug l ist of cas p orts list s hort ca name portnum ibstat command examples e display status of all IB ports enter ibstat e display status of mthcal ports enter ibstat mthcal e show status of specified ports enter ibstat mthcal 2 e Tolist
137. sing the error and traffic counters Data related options By default IBS analyses the data contained in the IBSDB database unless the s or l flags are used This default mode is known as database mode s switch Connected mode Connect to the switch specified by its hostname or IP address and then retrieve the NetworkMap xml and PortCounters csv files for this switch Local mode Use the NetworkMap xml and PortCounters csv files that are available locally or that are specified by the f and c flags for the analysis These files can then be analysed separately on a machine which is not part of the cluster However as stated above it is better to work within the OFED stack using the N and E options to obtain the latest data 2 22 BAS5 for Xeon Maintenance Guide 24 2 1 f filename Specify the file to be used when loading or saving the network file NetworkMap xml When used in conjunction with the s switch option the file downloaded from the switch will be saved to file filename When used in conjunction with the flag the specified file will be used as the input file filename Specify the file to be used when loading or saving the port counters file PortCounters csv fileJ When used in conjunction with the s switch option the file downloaded from the switch will be saved to the file filename When used in conjunction with the flag the specified file will be used as the input file
138. ssary to launch different stages in sequence The nsclusterstop script includes all the required stages 1 From the management node run nsclusterstop 2 Stop the management node 1 4 2 Starting the HPC Cluster To start the whole cluster in complete safety it is necessary to launch different stages in sequence The nsclusterstart script includes all the required stages 1 Start the Management Node 2 From the Management Node run nsclusterstart See Chapter 2 for details about the nsclusterstop nsclusterstart commands and their associated configuration files 1 4 BAS5 for Xeon Maintenance Guide Chapter 2 Day to Day Maintenance Operations 2 1 Maintenance Tools Overview This chapter describes a set of maintenance tools provided with a Bull HPC cluster These tools are mainly Open Source software applications that have been optimized in terms of CPU consumption and data exchange overhead to increase their effectiveness on Bull HPC clusters which may include hundred of nodes The tools are usually available through a browser interface or through a remote command mode Access requires specific user rights and is based on secured shells and connections Function Administration Backup Restore Monitoring Debugging Testing ConMan ipmitool Purpose Managing Consoles through Serial Connection nsclusterstop nsclusterstart Stopping Starting the cluster nsctrl
139. sume On Modem Ring Off _ a 1055 1 Last State Watch Dog Disabled Summary screen Enabled F1 oy t 9 5 Enter F16 Figure 7 1 Example BIOS parameter setting screen for NovaScale R421 FEN ON EL NIE US Configure serial port A Base 1 0 address 3F8 using options Interrupt IRQ 4 Serial port B Enabled Disabled Normal configuration Base 1 0 address 2F8 Interrupt IRQ 3 Enabled User configuration Auto BIOS or 05 chooses configuration 05 Controlled Displayed when controlled by 05 Fl v F9 Esc lt Enter F16 Figure 7 2 Example BIOS parameter setting screen for NovaScale R422 Managing the BIOS on NovaScale R4xx Machines 7 5 7 4 2 Mainboard BIOS NovaScale R421 BIOS Settings X7DBR 8 X7DBR 1 3 R421 BIOS setup section System Time System Date lt Current local time gt lt Current date gt ven Legacy diskette A Disabled Serial ATA Enabled Native Mode Operation Serial ATA SATA Controller Mode Option Compatible Advanced QuickBoot Mode Enabled QuietBoot Mode Disabled POST Errors Disabled ACPI Mode Yes
140. switch to stop it e Power on the Ethernet switch to start it e If an Ethernet switch must be replaced the MAC address of the new switch must be set in the ClusterDB This is done as follows 1 Obtain the MAC address for the switch generally written on the switch or found by looking at DHCP logs 2 Use the phpPgAdmin Web interface of the DATABASE to update the switch MAC address http IPadressofthemanagementnode phpPgAdmin user clusterdb and password clusterdb 3 the eth switch table look for the admin macaddr row in the line corresponding to the name of your switch Edit and update this MAC address Save your changes 4 dbmConfig command from the management node dbmConfig configure service sysdhcpd force nodeps 5 Power off the Ethernet switch 6 Power on the Ethernet switch The switch issues a DHCP request and loads its configuration from the management node See Bull HPC BASS for Xeon Administrator s Guide for information about how to perform changes for the management of the ClusterDB 1 3 Stopping Restarting a Backbone Switch The backbone switches enable communication between the cluster and the external world They are not listed in the ClusterDB It is not possible to use ACT for their reconfiguration Stopping Starting Procedures 1 3 1 4 Stopping Restarting the HPC Cluster 1 4 1 Stopping the HPC Cluster To stop the whole cluster in complete safety it is nece
141. the specified action e power off node ns1 enter nsctrl o poweroff force 51 nsl usr NSMasterHW bin nsmpower sh a off force m ipmilan H 1 u user2 e To ping node ns1 enter 4 nsctrl o ping 2 2 4 Remote Hardware Management CLI NS Commands The Remote Hardware Management CLI Command Line Interface is a set of commands that perform hardware tasks on Bull HPC these are also known as NS Commands These commands provide the administrator with an easy way to automate scripts to power on off and to get hardware information about the nodes 2 8 BAS5 for Xeon Maintenance Guide 2 2 5 2 2 5 1 Managing System Logs syslog ng For security and tracking purposes and also to decrease the amount of administration work resulting from the size of the cluster all the system logs are centralized on the Management Node There are two ways to send system log information to the Management Node e logs are collected on each node using standard mechanisms for archival and log file permutation Various utilities ensure compression transfer and archival of these log files on the Management Node in asynchronous mode A centralized operation is performed on the Management Node in order to extract and search events according to the criterion required for example date type gravity and so on This asynchronous process facilitates curative actions for the incidents that have occurred on the cl
142. tion you can get basic information about Host Board Adapters model link up or down When getting HBA inventory in verbose mode more details are available firmware levels serial number WWNN and WWPN for fibre channel HBAs Example lsiocfg cv EP HOST CHANNEL INVENTORY Host Driver Unique id Cmd Lun HostQ State Model hostO mptbase 0 7 hostl mptbase 1 7 host2 1 0 30 LINK_UP LP11000 DRV 8 0 30_p1 FW 2 10A7 B2D2 10A7 Bus Number 26 SN VM53824841 Host WWNN 20 00 00 00 c9 4b e7 02 Host WWPN 10 00 00 00 c9 4b e7 02 FN 20 00 00 00 c9 4b e7 02 speed 2 Gbit host3 usb storage 0 1 2 4 4 2 Disks Inventory Using the lsiocfg Disk inventory option you can get basic information about the available disks system location vendor Day to Day Maintenance Operations 2 35 Slate disk size When getting the disk inventory in verbose mode more details are shown model serial number firmware revision WWPN fiber channel devices lsiocfg dv DISK INVENTORY Dev Location Maj Min Vendor state Size MB QueueDepth Lname location Host Channel Id LUN sdb 0 0 10 0 8 16 SEAGATE running 286102 31 MODEL SEAGATE ST3300007LC FWREV 0003 SERIAL 3KROKTPH00007547TROP TRANSPORT SPI sdc 8 22 running 286102 31 MODEL SEAGATE ST3300007LC FWRE
143. to generate a new license if this is the case 3 The license vendor has changed encryption seeds rare MULTIPLE vendor daemon name servers running There are 2 Imgrd and vendor daemons running for this license file Only one process per vendor daemon per node is allowed to run Sometimes this can happen because the Imgrd was killed with a 9 signal which should not be done The Imgrd was then not able to bring the vendor daemon process down so it s still running although not able to serve licenses If Imgrd is killed with a 9 the vendor daemons also then must be killed with a 9 signal In general Imdown should be used Vendor daemon cannot talk to Imgrd This means a pre version 3 0 Imgrd version is being used with a 3 0 vendor daemon Simply use the latest version of Imgrd MUST be a version equal to or greater than the vendor daemon version This can also happen if TCP networking does not function on the node where you are trying to run Imgrd rare BAS5 for Xeon Maintenance Guide No licenses to serve The license file has only uncounted licenses and these do not require a server Uncounted licenses have 0 or uncounted in the number oflicenses field on the FEATURE line Other Starting Imgrd intel from a remote directory may lead to unknown results If Imgrd intel is started from a remote directory the license file line VENDOR INTEL Should be modified to include the root directory where the INTEL
144. tpl cfg file Day to Day Maintenance Operations 2 3 2 2 1 2 2 4 Using ipmi Tools The ipmitool command provides simple command line interface to the BMC Baseboard Management Controller To use SOL Serial Over Lan interface run the following command ipmitool I lanplus C O U BMC user name P BMC password H BMC IP Address sol activate BMC user name BMC password and BMC Address are values defined during the configuration of the BMC and are taken from those in the ClusterDB The standard values for user name password are administrator administrator ipmitool Command Useful Options Note If H is not specified the command will address the BMC of the local machine e start a remote SOL session to access the console ipmitool I lanplus C 0 H ip addr gt sol activate e reset the BMC and return to BMC shell prompt ipmitool I lanplus C 0 H ip addr bmc reset cold e edit the FRU of the machine ipmitool H ip addr gt fru print e To edit the network configuration ipmitool I lan H ip gt lan print 1 e To trigger a dump signal INIT ipmitool H ip addr gt power diag e power down the machine ipmitool H ip addr gt power off e perform a hard reset ipmitool H ip addr gt power reset e display the events recorded in the System Event Log SEL ipmitool H ip addr
145. u tq 1 4 sated ge e303 suted ge peubrzze PL 218303 pL peuBbtsse 9 iStssey gt Q spieog pJgoq on pZ 218303 0 ZIOZ SBZGMSI 0 9606 51 PZOGYSI 0 SOISY IZ YOH uog uog uog ESOSOSEEBE 5607 5607 26711669 5607 5607 562196627 5607 5607 97205 8 ts yrezssess 6 90 9259 6607 5607 az 55 26 6521 85205 560 t 61 17796 Ezrezsess 652 5607 sezeserezr 56219662 6607 5607 t 020909599 560 560 5t sezeserezr 092 560 n st SBPSET 90526 52 4 5607 88197 52721705 6 5607 E8058 16 le et BI9EEPII zEsO PEG 6607 5607 sezeserecr 5607 560 et 5621967627 5607 560 6 5950955 85 050 560 560 18 1 gt 5 192968 szereeot 9 12 129 269859 55219518 5607 560 Is 254094 Beseezt BI Z peddoup 512089 5 9588 or ar t LINX LINX NId 1M0d W307 8 020 24183104 W fZOGHSI SWYNISOH
146. ubnet and converts the output into a human readable form Global IDs node types port numbers port Local IDs and NodeDescriptions are displayed The full topology is displayed including all nodes and links with the option of highlighting those which are currently connected The output may be printed to a topology file Syntax ibnetdiscover options lt topology filename gt Non standard flags 4 List of connected nodes H List of connected HCAs S List of connected switches ibchecknet uses a topology file which has been created by ibnetdiscover to scan the network validating the connectivity and reporting errors detected by the port counters The command runs as follows ibchecknet A sample output is displayed below warn counter SymbolErrors 65535 threshold 10 warn counter LinkRecovers 26 threshold 10 warn counter LinkDowned 16 threshold 10 warn counter RcvErrors 21 threshold 10 warn counter RcvSwRelayErrors 54810 threshold 100 warn counter XmtDiscards 65535 threshold 100 Error check on lid 2 port all FAILED warn counter RcvSwRelayErrors 3995 threshold 100 Error check on lid 2 port 4 FAILED Checked Switch nodeguid 0x0008f104004118d8 with failure Checking Ca nodeguid 0x0008f10403979970 Checking Ca nodeguid 0 0008 10403979860 Checking Ca nodeguid 0 0008 104039798 Checking Ca nodeguid 0 0008 1040397996 Checking Ca nodeguid 0x0008f1040397
147. ure has expired Your license has expired The system time may be set incorrectly Run the date command to make sure the date is not later than the Expiration Date listed in the license file lt FEATURE name Invalid inconsistent license key The license key and data for the feature do not match This usually happens when a license file has been altered See Entering License File Data above System Bootup Problems For reasons unknown some bootup files etc rc sbin rc2 d etc refuse to run Imgrd with the simple commands indicated above Here are two workarounds 1 Use nohup su username lt umask 022 lmgrd c It is not recommended to run Imgrd as root the su username is used to run Imgrd as a non privileged user 2 Add sleep 2 after the Imgrd command 3 30 BAS5 for Xeon Maintenance Guide Chapter 4 Updating the firmware for the InfiniBand switches Voltaire switches should be properly configured to ensure maximum performance For example Voltaire switch firmware version 00 08 06 ASIC does not utilise Double Data Rate transfer for those links which include Mellanox cards and should be upgraded The Voltaire switch firmware upgrade procedure is described below 4 1 Checking which Firmware Version is running Go to the utilities menu as follows ssh enable switchname enable switchname s password voltaire Welcome to Voltaire Switch switchname Connecting switchname utilitie
148. uring standard operation Note See the Voltaire Switch User Manual ISR 9024 ISR 9096 and ISR 9288 2012 Switches for details on how to configure and use Port Counters and the Performance Manager This manual also includes a description of all the PortCounter fields and counter values Troubleshooting 3 1 2 12 Fabric Diagnostics Diagnostic is recommended in the following cases e During Fabric installation and during startup e Before running an application e Performance problems by locating discarded packets and link integrity problems e job run problem to locate malfunctioning nodes and get the overall fabric structure e Additional problems related to fabric stability blocking or other 2 1 8 Debugging Tools Tools available to perform diagnostic e Use the Topology to see current problems Error Log e Bad Ports Log e Current Alarms Table e Fabric Statistics portcounters csv file 3 1 4 High Level Diagnostic Tools 1 Enable the SM Fabric Inspect preferences for debugging Fabric Failure 2 Use the VFM VDM Port Counters Information and Graph window to check a specific port counter s health 3 Use the Event Log to discover that there is a problem in the fabric In the VFM right click and select View Event to get information to help identify where problem is located Alternatively you can show the Event Log from the CLI 4 Use the Current Alarms T
149. usceusccsccsceusectcceuccetccatcenss 6 1 6 1 The Baseboard Management Controller 2 2 6 1 6 1 1 Local access to the 22 00 0 0000000 6 1 6 1 2 Remote access to the 6 2 6 2 Updating the BMC Firmware on NovaScale R421 R422 R422 1 and R423 machines 6 4 6 3 Updating the BMC Firmware on NovaScale R421 E1 6 5 6 4 Updating the BMC firmware on NovaScale R440 and R460 6 6 6 5 Reconfiguring the BMC on RAxx machines 6 6 Chapter 7 Managing the BIOS on NovaScale R4xx Machines 7 1 7 1 Updating the BIOS on NovaScale R421 R422 R422 1 and 423 7 1 Z1 BIOS locally sx seri eR TS Ur 7 1 7 1 2 To install a new BIOS on remote machine using 7 2 7 1 3 To install a new BIOS on a remote machine using the Web interface R421 R422 R422 E1 and R423 ioo nece getreten a eie Nie per 7 2 7 2 Updating the BIOS on NovaScale R421 1 machines 2 2 7 7 3 7 2 1 Installing a new BIOS on a local R421 E machine ein pa ene iP REG 7 3 7 2 2 Installing the Bull HPC BIOS setup on
150. use of this command including the Local ID is below smpquery switchinfo 4 The resulting information output will be similar to that displayed below LinearFdbCaptzoc eseceme mp eee 49152 0 castEdbCapt eidwe XU e oes 1024 kunst eis 46 0 DefMcastPrdmPOrti o e 99 eem 0 DefMcastNotPrimPort 0 E e Pet d cua erae iba 15 25 ia 0 EA SPSPP ORES 0 PartEnforceCapti eele 9x ee 32 InboundBEarstEnE iz oe vf wwe 1 OutboundParbBEnEi see eese oie 1 FilterRawInbO hdi c o e 1 FilterRawInbound 1 BnhancedPor 0 3 2 2 perfquery perfquery uses Performance Management General Services Management Packets to obtain the PortCounters basic performance and error counters from the Performance Management Attributes at the node specified The command syntax is shown below Troubleshooting 3 7 3 8 perfquery options lt lid guid gt port reset Non standard flags Show aggregated counters all port of the destination lid T Reset counters after read R Only reset counters Examples e read local port s performance counters enter perfquery e read performance c
151. uster Some events are immediately reported to the Management Node Filters used which specify the type and gravity level of the events that have to be transferred immediately This synchronous process instantaneously gives the administrator a global view of system events syslog ng Syslog New Generation is the powerful system log manager used on Bull HPC clusters to manage cluster system logs and includes the following features e ability to filter messages based on content using regular expressions e Encoding and authentication of the network traffic e Forwarding logs using TCP and UDP protocols e 09 compression Configuring syslog ng syslog ng is installed on the cluster using the default configuration The scripts used to transfer log files are also installed The administrators can modify the default configuration according to their needs The etc syslog ng syslog ng conf file contains the configuration parameters for syslog ng This file is divided into five sections options section General options source section Source events destination section Log destinations filter section Filter definitions log section Actions to be performed on messages Day to Day Maintenance Operations 2 9 options Section Any general parameters may be configured in the options section An example is below Start of options area options 1 sync 0 4 Number of events before writing in the logs time reopen 10
152. wd BMC user password To install a new BIOS on a remote machine using the Web interface R421 R422 R422 E1 and R423 On the R421 R422 R422 EI and R423 platforms it is possible to access the BMC through the Web interface see Chapter 6 From the administration node 1 Start the Firefox navigator usr local firefox firefox 2 In the navigation bar type the URL of the remote BMC http BMC IP addr and login to the BMC BAS5 for Xeon Maintenance Guide Select the Virtual Media button and upload the BIOS image usr local firmware lt BIOS gt IMG corresponding to the machine Select the Console Button to access the console of the remote system Restart the remote system The BIOS DOS image will boot and flash the new BIOS The progression can be followed in the console window When the BIOS update is ended the DOS prompt appears in the console window Select the Virtual Media button and discard the BIOS DOS image Reset the machine using the Remote Control button 7 2 Updating the BIOS NovaScale R421 machines The BIOS and BMC firmware for NovaScale R421 E1 machines are updated using the IntelPOne boot Flash Utility OFU 7 2 1 Installing a new BIOS on a local R421 E1 machine l 2 3 Install the OFU RPM package that contains OFU Linux tools Install the bios R421E1 lt bios version RPM that contains the BIOS package and BIOS setup configuration file The corresponding
Download Pdf Manuals
Related Search
Related Contents
CATALOGO SUBDIRECCIÓN DE ADQUISICIONES PROCURADURÍA KitchenAid TOP-MOUNT REFRIGERATOR Refrigerator User Manual HT-PM50W Protector Mineral Holzadac G350-10!User Manual Genesis Advanced Technologies Saw GTS10S User's Manual Vitant。niD縄 取扱説明書 - Vitantonio ビタントニオ Lien cliquable が危険 Hææ: - HIREC Copyright © All rights reserved.
Failed to retrieve file