Home
Mellanox InfiniBand Sep 2013 InfiniBand debug
Contents
1. Switch 0x0008 1040041082a ports 24 ISR9024 Yoltaire enhanced port 0 lid 1 Imc 0 Switch 0x0008f104003 1090 ports 24 ISR9288 15R9096 Voltaire sLB 24D base port 0 lid 2 lme 0 e Lists all switches in cluster switch 0x0008 10400403349 ports 24 ISR9096 Voltaire sFB 4D enhanced port 0 lid 4 Imc 0 Switch 0x0008f104003 1091 ports 24 ISR9288 15R9096 Voltaire sLB 24D base port 0 lid 3 lme 0 ibhosts Ca 0x0008 1040396a490 ports 2 ws201 HCA 1 e i Ca 0x0008 1040396b740 ports 2 ws200 HCA 1 pote al ano Iyer Ca 0x0008 1040396e6cc ports 2 ws203 HCA 1 btracert root ws203 sbin ibtracert 9 7 Shows path between 2 lids From ca 0x0008 1040396e6cc portnum 2 lid 9 9 ws203 HCA 1 2 gt switch port 0x0008 1040041082a 8 lid 1 1 ISR9024 Voltaire 14 gt switch port 0x0008f104003 1090 15 lid 2 2 ISR9288 ISR9096 Voltaire sLB 24D 10 gt switch port 0x0008 10400403349 13 lid 4 4 ISR9096 Voltaire sFB 4D 17 gt switch port 0x0008f104003 1091 11 lid 3 3 ISR9288 ISR9096 Voltaire sLB 24D 16 gt ca port 0x0008 1040396b741 1 lid 7 7 ws200 HCA 1 To ca 0x0008 1040396b740 portnum 1 lid 7 7 ws200 HCA 1 2013 Mellanox Technologies Mellanox Confidential Cluster utilities Iblinkinfo Reports link info for each port in an IB fabric node by node Iblinkinfo can be used in a hybrid fabric to identify sub optimal links Use iblinkinfo grep Could 570 22 4X
2. 10 0 Gbps FDR10 Active LinkUp gt 617 9 MFO it ib1 SXX536 L36 U1 Could be 14 0625 Gbps 570 28 4X 10 0 Gbps FDR10 Active LinkUp gt 617 3 MFO it ib1 SXX536 L36 U1 Could be 14 0625 Gbps 58 8 4X 10 0 Gbps Active LinkUp gt 487 1 hydraio12 HCA 1 Could be 14 0625 Gbps 58 12 4X 10 0 Gbps FDR10 Active LinkUp gt 150 1 hydraio24 HCA 1 Could be 14 0625 Gbps 3 allows a basic subset of standard SMP queries including the following node info node description switch info port info e Common ops e Nodelnfo NI lt addr gt e NodeDesc ND lt addr gt e Portinfo PI lt addr gt lt portnum gt e Switchinfo SI lt addr gt e PKeyTable PKeys lt addr gt lt portnum gt 2013 Mellanox Technologies Mellanox Confidential Reading OpenSM log Mellanox TECHNOLOGIES Find the SM sminfo sm lid 573 sm guid 0x2c90300fe2ed1 activity count 26181972 priority 15 state 3 SMINFO MASTER Query node description smpquery nd 573 2013 Mellanox Technologies Mellanox Confidential Thank You Mellanox TECHNOLOGIES
3. Replace the cable s LI Yes L No LI Yes LI No with known good cable s of the same type di If the issue relates to port s functionality Connect the port s to LI Yes LI No Li Yes LI No other known working port s destinations If the issue relates to port s functionality Perform loopback LI Yes O No Li Yes LI No Check No if the port s remained faulty while connected to other port s on the same switch test with other working port s on the same switch 2013 Mellanox Technologies Mellanox Confidential Basic Tier 1 debug 1U Switch Mellanox TECHNOLOGIES Switch Front Status LED indicators state Click here to enter text Switch Rear Status LED indicators state Click here to enter text PSU module Status LED indicators state Click here to enter text di FAN module Status LED indicators state Click here to enter text Active SM location switch Open SM UFM Click here to enter text Relevant port s Status LED indicators state Click here to enter text 2013 Mellanox Technologies Mellanox Confidential Basic Tier 1 debug 1U Switch Mellanox TECHNOLOGIES a EE SX60xx series Check Yes if Verify software version currently installed the switch show version software version If software upgrade is needed please use the following link in order to installed equals download the latest software version available and documentation to latest Nttp support mellanox com Sup
4. 06 Mellanox InfiniBand Sep 2013 Mellanox TECHNOLOGIES InfiniBand debug Connect Accelerate Outperform Mellanox TECHNOLOGIES IB bring up e Topology matching e Fabric clean up e SM optimization Fabric Debug 2013 Mellanox Technologies Mellanox Confidential IB Bring Up Procedure Overview Make sure all nodes are e Responding e Have the same OFED version e All HCAs are attached to driver e HCAs have expected FW version Run SM Make sure all ports are in active state Loop while fabric cleaning in idle mode Match topology Loop while fabric cleaning under stress Optimize SM 2013 Mellanox Technologies Fabric clean up Fabric clean is used to filter out bad HW including bad cables bad ports or bad connectivity fabric cleanup algorithm 1 2 3 Zero all counters ibdiagnet pc Wait X time Check for errors exceeding allowed threshold during this X time ibdiagnet lw 4x ls 10 P all 1 Fix problematic links re sit or swap cables replace switch ports or HCAs Go to 1 2013 Mellanox Technologies Mellanox Confidential What are we looking for Mellanox TECHNOLOGIES e Fabric configuration Issues SM Environmental e Communication errors e Switch Module IPR FCR status e Hardware Cable e Fabric topology Issues 2013 Mellanox Technologies Mellanox Confidential Recommended Practic
5. OLOGIES Environment information a Mellanox Firmware Tool MFT cat etc issue Click here to enter text Download and install MFT http www mellanox com content pages php pg management tools amp menu section 34 uname a Click here to enter text Refer to the User Manual for the installation instructions cat proc cpuinfo grep model name Click here to enter text Once installed run uni mst start q ofed_info head 1 Click here to enter text mst status flint d lt mst device gt q Click here to enter text ifconfig a Click here to enter text a Ports Information ethtool lt interface gt Click here to enter text ic ibstat Click here to enter text IA bv devinfo Click here to enter text ethtool i Click here to enter text Firmware Version Upgrade O Yes O einleneice 20h MESERO Download latest firmware version using the No ibdev2netdev Click here to enter text PSID board ID http www mellanox com supportdownloader Card Detection flint d lt mst_device gt i lt firmware bin file gt b Ispci grep i Mellanox Click here to enter text Collect log file var log messages dmesg gt system log 2013 Mellanox Technologies Mellanox Confidential Basic Tier 1 debug 1U Switch Mellanox TECHNOLOGIE S Operation description Operation Have this performed operation resolved the issue Power cycle the switch LO Yes LI No U Yes LI No a If the issue relates to port s functionality
6. Total number of minor link errors Usually an 8b 10b error due to a bit error Link Recovers Total number of times the Port Training state machine has successfully completed the link error recovery process LinkDowned Total number of times the Port Training state machine has failed the link error recovery process and downed the link RcovErrors ea ee of packets containing an error that were receive on the port Usually due to a CRC error caused by a bit error within e packet RcvSwRelayErrors Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay This counter should typically be ignored since Anafa ll has a bug that counts these when it gets a multicast packet on a port where that port also belongs to th multicast group of the packet XmtDiscards Total number of outbound packets discarded by the port because the port is down or congested Usually due to the output port HOQ lifetime being exceeded VL15Dropped Number of incoming VL15 packets dropped due to resource limitations e g lack of buffers in the port XmtData RcvData e Total number of 32 bit data words transmitted and received on the port XmtPkts RcvPkts Total number of data packets transmitted and received on the port 2013 Mellanox Technologies Mellanox Confidential Cluster utilities Mellanox ECHNOLOGIES ioswitches root ws203 sbin ibswitches
7. e for Cleaning the Fabric Zero Port counters Very important to always start with a clean baseline e Run stress test across fabric Pallas Intel MPI Benchmark mpi bandwidth mpi latency perf_main etc Identify issues through Port counters Congestion Bad links Packet Loss etc e Locate and fix problems Cable Faults HCA reseating etc Mellanox Confidential 2013 Mellanox Technologies Capture Log Files for support AMN TECHNOLOGIES Switch related LOGs tar from the export logs command Portcounters csv ibnetdiscover output Host related Dmesg output Any onscreen errors Uname ar Cat etc issue var log messages Ismod Ispci ibv_devinfo ib setup 2013 Mellanox Technologies Mellanox Confidential Basic Tier 1 debug HCA Operation description Operation performed Reseat the card LI Yes FE Replace the cable with a known good cable O Yes Connect the cable to another known working port s L Yes LS LIS Swap the card with a known good card LI Yes No No No No Mellanox Have this operation resolved the issue LI Yes LI No Yes LI No Yes LI No Yes LI No Check No if the issue migrated with the faulty card PE Ports LED indicators state Device information which the card is connected to a Active SM location 2013 Mellanox Technologies Mellanox Confidential Basic Tier 1 debug HCA Mellanox TECHN
8. in case its not use ib config to configure it lt ib config h for help gt 4 If not ACTIVE you should have no LED s on the HCA 5 Check that SM in running 2013 Mellanox Technologies Troubleshooting Cont Networkino a Check that you can ping between nodes on IPoIB Run the command ifconfig and make sure the following line appears exactly at your IB interface 8 Ifthe RUNNING The IPOIB host is not joined to the IB multicast group In this case check the SM health 9 Check for IP problems such as duplicate IP wrong routing table or wrong destination address 10 If not check to see you have latest firmware on the switches and HCA ASIC 11 Run Cable and Link Tests 2013 Mellanox Technologies 06 ibdiagnet Mellanox TECHNOLOGIES Connect Accelerate Outperform Cluster utilities ibdiagnet ibdiagpath Mellanox NOLOGIES Integrated diagnostic tools e Queries cluster topology and indicates any port errors link width or link speed mismatch e Automates calls to many low level operations Easy to use e Similar flags logs and reports for both tools e Report using meaningful names when topology file is provided 2013 Mellanox Technologies Mellanox Confidential Ibdiagnet Tool bdiagnet is an integrated Infiniband fabric diagnostics command line tool t scans the IB fabric using directed lid route packets and extracts the available infor
9. mation regarding its connectivity and devices status t then checks for errors in the following scopes e Ports Counters thresholds port state e Nodes Firmware versions LID assignmets e Links Links speed and width Cables info e Fabric Topology matching Subnet Manager Routing Errors are reported to screen and saved in a log file 2013 Mellanox Technologies Mellanox Confidential Ibdiagnet ibdiagnet scans the fabric using directed lid route packets and extracts all the available information regarding its connectivity and devices It then checks errors on ports nodes links and cluster scopes and reports them ibdiagnet is included in the ibutils package which is part of Mellanox OFED Common usage example ibdiagnet pc r Is 14 Iw 4x get cable info pm_pause_time Time for test in sec e g 1200 o var ibdiagnet2_ date F_ H_ M_ S pc Perform a clear counters fabric wise r Check for routing issues Iw lt 1x 4x 12x gt Is lt 2 5 5 10 14 gt e Link speed and width checked on every port in the network get cable info Read the cable info type length manufacturer etc pm_pause time lt I gt e Time to sleep before resume collecting counters 9 lt Out dir gt Output directory ibdiagnet pc r ls 14 Iw 4x get cable info pm_pause_time Time for test in sec e g 1200 o var ibdiagnet2_ 2013 Mellanox Technologies Mellanox Confidential Mella n
10. mbolErrors Total number of minor link errors Usually an 8b 10b error due to a bit error Link Recovers Total number of times the Port Training state machine has successfully completed the link error recovery process LinkDowned Total number of times the Port Training state machine has failed the link error recovery process and downed the link RcveErrors Id rie of packets containing an error that were receive on the port Usually due to a CRC error caused by a bit error within e packet RcvSwRelayErrors Total number of packets received on the port that were discarded because they could not be forwarded by the switch deduct amp This counter should typically be ignored since Anafa ll has a bug that counts these when it gets a multicast packet on a port where that port also belongs to the multicast group of the packet XmtDiscards e Total number of outbound packets discarded by the port because the port is down or congested Usually due to the output port HOQ lifetime being exceeded VL15Dropped e Number of incoming VL15 packets dropped due to resource limitations e g lack of buffers in the port XmtData RcvData e Total number of 32 bit data words transmitted and received on the port XmtPkts RcvPkts Total number of data packets transmitted and received on the port 2013 Mellanox Technologies Mellanox Confidential 26 Error counter review Mellanox TECHNOLOGIES SymbolErrors
11. ox TECHNOLOGIES jbdiagnet pc r ls 10 Iw 4x get cable info pm pause time 200 2013 Mellanox Technologies Mellanox Confidential Output files m SE log A dump of all the application reports generate according to the provided ags ibdiagnet lst List of all the nodes ports and links in the fabric ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet db csv A dump ofthe internal subnet database 2013 Mellanox Technologies Mellanox Confidential Ibdiagnet usage Fabric Cleaning Mellanox TECHNOLOGIES m lodi tis particularly useful in finding misconfigured links speed width topology mismatches and marginal PRIR Is ues y 3 3 GR PASO 9 Typica usage Clear all port counters using ibdiagnet pc e Stress the cluster e Check cluster using ibdiagnet P all 1 Is 10 Iw 4x pc get_ cable info pm pause time Checks for link speed link width and port error counters greater than 1 Ri Ce rooti metilab32 gt PH Counters Info Ho illegal PH counters values were found I i i pr pp te f fn n n pn n pp pn i pn n n n pr pr em a n pp n n n pr pn fin I Linke With links width 4x tas set by lw o
12. portWeb Switches infiniband_switches S X60XX Follow the upgrade instruction in the User Manual chapter 4 3 20183 Mellanox Technologies Mellanox Confidential Host Troubleshooting e Bad cabling IPoIB Interface problem Missing Configuration e HCA problem SM problem Let s start by checking the basics 2013 Mellanox Technologies Mellanox Confidential Troubleshooting Cont Be sure cables are plugged in properly Check that the SM is running Login to the master Switch CLI Run the command sm info show and make sure that sm mode is enabled and sm state is master Run the command sm info show few times make sure sm activity counter is progressing ein case the sm state is not master it means that other switch or node in the fabric is running another SM that may be the master 2013 Mellanox Technologies Mellanox Confidential Troubleshooting Make sure HCA is working Mellanox 1 Run lspci check that you see mellanox HCA is identifiede on the PCI bus 2 If not reseat HCA or the raiser card 3 Replace HCA with another 4 Check that the Host links are active 2013 Mellanox Technologies Mellanox Confidential Troubleshooting Cont Networking 1 Check that the IPoIB interface is up 2 Run ifconfig a to view all network interface it might be that the ibO or ib1 is there but not activated 3 Run ifconfig make sure your IPoIB interface is configured
13. ptions I ce es ei ni ein en SS SS i Sie ue i Ges Sy mS al Sse ms mess us ne is Sa Guns Glen cms EN ES f Sn a Ho unmatched Links with width 4x3 were found Links With links speed 5 Cas set by l option I cate ii a a een No unmatched Links with speed 5 were Found Fabric Partitions Report see ibdiagnet pkey for a Full hosts list Fkeyt 0x FFF Hostet2 full partial a I EE ee ee ee ey ENN FE ee O E IPoIB Subnets Check Subnet IPv4 PKeytOx7Ffff Key t0x00000b4ib HMTU 204SByte ratet10Gbps SLtox00 Suboptimal rate For group Lowest member rate 20Ghbpa gt group rate tr 10Gbpa Bad Links Info Ho bad link were found I TERENE I Stages Status Reporti STAGE Errors Warnings Bad GUID LIO Check Link State Active Check Performance Counters Report Specific Link Width Check Specific Link Speed Check Partitions Check IPoIB Subnets Check Please see tmpfibdiagnet log for complete log one Run time was 1 seconds root mtilabs2 EF 2013 Mellanox Technologies Mellanox Confidential 24 Cluster utilities ibnetdiscover Reports a complete topology of cluster Shows all interconnect connections reporting e Port LIDs e Port GUIDs e Host names e Link Speed GUID to name file can be used for more readable topology in regards to switch devices 2013 Mellanox Technologies Mellanox Confidential Error counter review Mellanox TECHNOLOGIES Sy
Download Pdf Manuals
Related Search
Related Contents
ックス ECOV-110 User's Manual MSI N750TI TF 2GD5/OC NVIDIA GeForce GTX 750 Ti 2GB graphics card 取 主及 妻兒 日月 重 この度は弊社商品をお買し ~上げ頂き Service Manual - Arrow International PPM Users Manual Signature Software 1re et 2e années Ciudad y Fecha - Recinto del Pensamiento APMD-150 Copyright © All rights reserved.
Failed to retrieve file