Home

No Fault Found events in maintenance engineering Part 2_ Root

1. Operator handling ergonomics training Printed circuit Boards PCB Ageing components and connectors Loose PCB interconnectors Disconnected solder points Damaged wiring or cabling Ww N WO o Noa A A recent aerospace survey 16 has ranked intermittent faults as the major cause of NFF events whereas built in test equipment BITE coverage and software are least likely This is contrary to the common belief that the majority of failures are due to incompa tible or competing software routines between systems 17 Inter mittency is arguably the most problematic of the NFF events due to their elusive nature making detection by standard test equip ment difficult 5 The faulty state will often lay dormant until a component is back in operational use where it eventually causes further unit removals unless a genuine cause is found fault isolation It should be emphasized that these failures are not always present during testing which make them troublesome to isolate This situation can result in repeated removals of the same equipment for the same symptom with each rejection resulting in the equipment being tagged as NFF 18 At this stage there is a very high probability that there will be a loss of system function ality integrity and an unacceptable compromise in safety require ments What is clear is that even though these faults may begin as short duration low frequency occurrences as time passes the
2. Establishing a consistent NFF taxonomy 2 Failure knowledge bases novel FMEA tools and troubleshooting guides specific for NFF to improve diagnostic success rates 3 Development of assessment tools to assess maintenance cap ability or effectiveness which may include i Recording and cross referencing test station configuration and performance statistics with NFF occurrences 4 This includes statistics on equipment calibrations ii Ensuring that the testing environment is correct and inves tigations into whether testing procedures need modification to consider multiple environmental factors humidity tem perature vibration etc simultaneously 4 Introduction of integrity testing as complimentary to standard ATE functional testing procedures i Integration of on board health and usage monitoring ii Standardization for intermittent testing and procedures for dealing with intermittent fault occurrences 5 NFF specific maintenance cost models for design justification and NFF tracking 6 Modeling of complex interactions between system and com ponents and their physics of failure 7 Modeling of intermittent failures from a fundamental perspec tive including standardized testing equipment and procedures Acknowledgements This research was partially supported by the Engineering and Physical Sciences Research Council EPSRC Ministry of Defence BAE Systems Bombardier Transportation and Rolls Royce The S K
3. 36 Knotts RM Civil aircraft maintenance and support fault diagnosis from a business perspective J Qual Maintenance Eng 1999 5 4 335 48 37 Granstrom R Soderholm P Condition monitoring of railway wheels and no fault found problems Int J COMADEM 2009 12 2 46 53 38 Henning S Paasch R Designing mechanical systems for optimum diagnosa bility Res Eng Des 2010 21 2 113 22 39 Phillips P Diston D A knowledge driven approach to aerospace condition monitoring Knowl Based Syst 2011 24 6 915 27 40 Nowlan FS Heap HF Reliability centered maintenance United Air Lines Inc San Francisco CA 1978 41 Moubray J Reliability centered maintenance Industrial Press Inc 2001 42 D Eon P Reducing nffs through knowledge sharing First annual symposium on tackling no fault found in maintenance engineering 2013 43 Pecht M Prognostics and health monitoring of electronics John Wiley amp Sons 2008 44 Task Force MSG Maintenance program development document MSG 3 Washington DC Air Transport Association ATA of America 1993 45 Ahmadi A Sderholm P Kumar U On aircraft scheduled maintenance program development J Qual Maintenance Eng 2010 16 3 229 55 46 Huby G Cockram J The system integrity approach to reducing the cost impact of no fault found and intermittent faults In UK RAeS airworthiness and maintenance conference 2010 47 Kumar S Vichare NM Dolev E Pecht M A health indicator method for
4. testability methods and the necessary design guidance to mitigate the problem are covered in Section 6 2 No fault found occurrences in systems 2 1 Electronic systems Electronic failures are not often considered as static nor random or pseudorandom events but rather the result of mechanical and material changes 9 10 These changes seldom lead to a loss of functionality of an electronic system even though their components maybe out of specification This is due to the electronics having an inherent self compensating aspect that makes the task of failure diagnostics difficult and directly con tributes to a successful diagnosis In addition degradation of failure modes often manifest differently depending upon the operating environment that may offset components and the circuit configuration 11 Thomas et al 12 and Renner 13 investigated the root causes of NFF in automotive electronic systems It was revealed that an overwhelming majority of occurrences can be traced back to poor manufacturing i e soldering and Printed Circuit Board PCB assembly and inherent design flaws which include violation against specifications Vichare and Pecht 10 Qi et al 14 and Moffat 15 have summarized some generic causes of failures within electronic systems 1 Interconnect failures including connectors System design electrical and mechanical Environmental conditions temperature moisture chemicals mechanical stresses
5. ventional testing equipment do not provide effective test coverage for intermittency one of the major drivers for NFF Other alter natives to address the intermittency problem which try to use traditional measurements include methods such as tracking and comparing circuits down to fractions of a milliohm one circuit at a time against long running records of similar measurements However there are some major limitations to this approach when an intermittent circuit is in a temporary working state it will generally pass such tests and only those approaching hard failure status will be detected this way Also measuring fractions of a milliohm and attempting to take meaningful action based on these values is extremely difficult time consuming and requires precise control in the test set up and test environment Appro priate test equipment is required to address the intermittency issue and to resolve all of the variables causing this unpredict ability providing the maintainer with a quick and comprehensive route to a successful outcome Overcoming the testing challenges posed by intermittent problems require a different approach to that of using conventional digital equipment predicated on accu racy of measurements and time consuming results analysis Truly effective and practical detection of intermittency requires improved test coverage and consequently vastly improved prob ability of detection There are also a variety of other high
6. Achieving diagnostic success In order to improve diagnostic success rates improvements need to be made to processes procedures and technology which have failed Initial research shows that work towards this goal is patchy and there is definitely more to do There is almost certainly not one universal industrial solution The current key areas for NFF mitigation are focused around understanding test coverage represented by BIT BITE ATE deficiencies development of new maintenance troubleshooting tools techniques and con cepts as well as changes to management processes Accurate fault models fault event trees and system understanding are paramount to recognizing false BIT alarms caused by such things as a sensor system synchronization Also new systema tic tests should be identified in the product design These tests would aim at allowing multiple testing of stressors identifying weaknesses and flaws and the critical contributors to failures before the product is put into service 6 Concluding remarks An important part of any new research subject is the design and maintenance of a reference collection of relevant publications To the best of the authors knowledge the performed study has moved the body of scientific knowledge forward by reviewing existing literature related to NFF and pointing out core gaps where current efforts should be focused on An attempt is made to comprehensively review academic journal literature and confe
7. In accor dance to the ARINC 672 77 diagnostic testing should consider multiple level tests e g during operation and at different main tenance echelons Historically it is recognized that BIT had been designed and used primarily for in field maintenance by the end user but they are now used in evermore diverse applications which include oceanographic systems multichip modules large scale integrated circuits power supply systems avionics and also in passenger entertainment systems for the Boeing 767 and 777 72 BIT is used to indicate system status providing valuable information to locate the exact system components that need to be replaced and to indicate whether or not a system has been assembled correctly Failures reported by BIT tests can be costly and are likely to result in unit replacements recertification or inevitable loss of availability of the equipment 1 Even though these checks may be designed as a means to detect and locate equipment faults there are a variety of shortcomings which contribute to the NFF phenomena Many experts advocate that the design of a BIT system is a non trivial task and rely deeply on the knowledge of all the system interactions 5 43 Due to this it is often difficult to define a fixed set of test procedures that can verify the full functionality of a component This has led to log reports containing spurious fault detection For example operator pilot reports of faults often do not
8. Risk priority numbers can also be assigned to each of the failure modes based on factors such as detectability severity and occurrence can occur on any aircraft would be swiftly and correctly identified by any maintenance personnel following step by step procedures However FIM fails to identify the problem the maintainers rely heavily on their experience 5 Other resources are often used to help escalation channels technician training supporting documentation etc 4 On site or practical feedback To close the loop with reliability new system failure modes are often discovered adding to the troubleshooting difficulties 26 and acts as a source of feedback to design engineering for reliability improvements 3 1 Health and usage monitoring Condition based maintenance CBM programmes can be aimed at either fault diagnostics or prognostics 35 Diagnostics refers to a posterior event analysis and deals with fault detection indicates a fault has occurred fault isolation faulty component is identified and fault identification the nature of the fault is determined Prognosis is a prior event analysis and deals with failure prediction before faults occur making use of in situ sensors and physics of failure models 27 If it is possible to assess in situ the extent of degradation of electronic systems then such data would be invaluable in meeting the objective of providing efficient fault detection and identification
9. White R Richardson B Anecdotal experiences on the value of limited environmental testing for the analysis of no fault found assemblies In AUTOTESTCON Proceedings 2011 p 292 296 93 Ramsey J Special report avoiding NFF avionics magazine 2005 94 Chang YS Oh CH Whang YS Lee JJ Kwon JA Kang MS et al Development of rfid enabled aircraft maintenance system In Industrial informatics 2006 IEEE international conference on IEEE 2006 p 224 229 95 He W Xu C Ao Y Xiao X Lee EW Tan EL Rfid enabled handheld solution for aerospace mro operations track and trace In Emerging technologies amp factory automation ETFA 2011 IEEE 16th conference on IEEE 2011 p 1 8 96 Narsing A Rfid and supply chain management an assessment of its economic technical and productive viability in global operations J Appl Bus Res JABR 2011 21 2 1 6 97 O Connor M Boeing wants dreamliner parts tagged RFID J 2005 98 Roberti M Boeing airbus team on standards RFID J 2004
10. the DC resistance becomes constant The use of RF impedance is researched at length by Kwon 72 who demonstrates prognostic capabilities which are able to predict the remaining useful life of the solder joint with an error less than 3 The research also demonstrates the ability to distinguish between two competing interconnects failure modes solder joint cracking and pad crater ing the need for such failure distinctions in this case however is unclear The use of embedded molecular test equipment within ICs enabling them to continuously test themselves during normal operation providing visual indications of failure has been pro posed by GMA Industries as one of the more advanced and futuristic monitoring technologies 29 The sensors are used to measure electrical parameters and various signals such as current and voltage as well as sensing changes in the chemical structure of integrated circuits that are indicative of developing failure modes The basic structure of the sensors are carbon nanotubes and the integration of these sensors with conventional IC s along with molecular wires for the interconnecting sensor networks is the important focus of this research However no details of demonstrable in service products or prototypes are given and to date no research paper offering proofs on the applicability of the concept has been found Recently a sensitive analyzer was introduced by Universal Synaptic to simultaneously monitor test lines f
11. Jin T Janamanchi B Feng Q Reliability deployment in distributed manufactur ing chains via closed loop six sigma methodology Int J Prod Econ 2011 130 1 96 103 10 Vichare NM Pecht MG Prognostics and health management of electronics IEEE Trans Compon Packag Technol 2006 29 1 222 9 11 Line JK Krishnan G Managing and predicting intermittent failures within long life electronics In Aerospace conference 2008 IEEE 2008 p 1 6 12 Thomas DA Ayers K Pecht M The trouble not identified phenomenon in automotive electronics Microelectr Reliab 2002 42 4 641 51 13 Renner JH Reliability engineering an integrated approach at Daimler Chrysler In Integrated reliability workshop final report 1999 IEEE International 1999 p 152 153 14 Qi H Ganesan S Pecht M No fault found and intermittent failures in electronic products Microelectr Reliab 2008 48 5 663 74 15 Moffat BG Abraham E Desmulliez MP Koltsov D Richardson A Failure mechanisms of legacy aircraft wiring and interconnects IEEE Trans Dielectr Electr Insul 2008 15 3 808 22 16 Huby G No fault found Aerospace survey results copernicus technology ltd 2012 Technical report Copernicus Technology Ltd 2012 2012 17 Jones J Hayes J Investigation of the occurrence of no faults found in electronic equipment IEEE Trans Reliab 2001 50 3 289 92 18 James IJ Learning the lessons from in service rejection In Systems reliability and maintain
12. always correspond to the test logs resulting in overlooked maintenance issues Also even with the sophistication of modern tests there is still a major issue of removed units reported by the test to be at fault but upon testing being found to have no faults or even faults that do not correlate to the BIT reports As well as the false alarm issue other factors such as assessment coverage and inappropriate parameter limits can in turn contribute to NFF events 2 Assessment coverage deals with the nature of the BIT which could be designed in several different ways making the checks dependent on the monitored equipment and system scale A system wide BIT will either be centralized where dedicated hardware is used to control all functions or decentralized where a number of test centers can be incorporated and processed at the Line Replaceable Unit LRU level Decentralization of tests enable the ability to check the functionality of key circuits helping to identify problems much closer to the root causes than is the case in the centralized view making for a cost effective assembly and maintenance operations 43 The nature of BITs will be in some way dependent upon a set of pre defined statistical limits for the various parameters which 11 _this has been discussed in Part 1 Section 4 12 A Line Replaceable Unit LRU level is the lowest level when a modular or sub unit item of the system can be easily replaced and quickly intercha
13. conference proceedings Vol 5 2004 p 3354 3360 cited by since 1996 2 65 Sharma CR Furse C Harrison RR Low power stdr cmos sensor for locating faults in aging aircraft wiring IEEE Sens J 2007 7 1 43 50 66 Lo C Furse C Noise domain reflectometry for locating wiring faults IEEE Trans Electromagn Compat 2005 47 1 97 104 67 Chung YC Furse C Pruitt J Application of phase detection frequency domain reflectometry for locating faults in an f 18 flight control harness IEEE Trans Electromagn Compat 2005 47 2 327 34 208 S Khan et al Reliability Engineering and System Safety 123 2014 196 208 68 Furse C Chung YC Lo C Pendayala P A critical comparison of reflectometry methods for location of wiring faults Smart Struct Syst 2006 2 1 25 46 69 Parkey CR Hughes C Caulfield M Masquelier MP A method of combining intermittent arc fault technologies In AUTOTESTCON Proceedings 2012 p 244 249 cited by since 1996 1 70 Smith PA Campbell DV A practical implementation of bics for safety critical applications In Defect based testing 2000 Proceedings 2000 IEEE interna tional workshop on IEEE 2000 p 51 56 71 Bhatia A Hofmeister JP Judkins J Goodman D Advanced testing and prognostics of ball grid array components with a stand alone monitor ic IEEE Instrument Meas Mag 2010 13 4 42 7 72 Kwon D Detection of interconnect failure precursors using RF impedance analysis PhD thesis University
14. finish is well known to produce conductive metalwhiskers that are capable of producing unintended current paths These failures usually appear intermit tently making it difficult to identify them as a root cause to the problem they are easily broken off and can melt to remove a previously existing short 8 In the case of a reported failure where there is no hard or definite symptom for a sufficient fault diagnosis there will be the need for additional technical data or specialist technical knowledge This can be in the form of main tenance history troubleshooting guides or expertise from experi enced colleagues and specialists 2 5 2 1 2 Harness wiring A key aspect of interconnect and wiring related failures is that they will often not be detected by traditional one path at a time sequential mode of analysis 22 The traditional approach not only fails to spot time dependant failures such as those exhibited under vibration but could inherently ignore combinatorial faults that occur due to wire to wire interactions Another issue is when chafed wiring occurs where a harness is routed through a structure that experiences high vibration levels Unless adequate protection such as cable clamps ties sleeving etc are provided the wiring bundle will brush the structure in such a way that These failures can occur under several scenarios a common failure is where surface mount packaging used are knocked off during soc
15. of Maryland 2010 73 Steadman B Berghout F Olsen N Sorensen B Intermittent fault detection and isolation system In AUTOTESTCON 2008 IEEE IEEE 2008 pp 37 40 74 Sorensen B Apparatus for testing multiple conductor wiring and terminations for electronic systems U S patent no 8 103 475 2012 2012 75 Muja O Lamper D Automated fault isolation of intermittent wiring conduc tive path systems inside weapons replaceable assemblies SAE Int J Aerospace 2012 5 2 579 89 76 Smith P Kuhn P Furse C Intermittent fault location on live electrical wiring systems SAE Int J Aerospace 2009 1 1 1101 6 cited By since 1996 3 77 A W 672 Guidelines for The Reduction of No Fault Found NFF ARINC 2008 78 Rosenthal D Wadell BC Predicting and eliminating built in test false alarms IEEE Trans Reliab 1990 39 4 500 5 79 Ungar LY Kirkland LV Unraveling the cannot duplicate and retest ok problems by utilizing physics in testing and diagnoses In AUTOTESTCON Proceedings 2008 p 550 555 cited by since 1996 1 80 Metra C Francescantonio SD Mak T Clock faults impact on manufacturing testing and their possible detection through on line testing In Test con ference 2002 Proceedings international IEEE 2002 p 100 109 81 O Connor P Testing for reliability Qual Reliab Eng Int 2003 19 1 73 84 82 Qingchuan H Wenhua C Jun P Ping Q Improved step stress accelerated life testing method for electro
16. present operating context 40 41 7 Maintenance Steering Group 3 MSG 3 based maintenance provides a top down approach to determine the most applicable maintenance schedule and the interval for an aircraft s major components and structure The methodology effectively delivers significant improvements in an aircrafts availability and operational safety whilst optimizing the costs of ownership 44 45 8 Failure mode effects and criticality analysis FMECA is an extension of FMEA 48 the influence of life cycle loads on a specific mission critical system The added bonus of this data is that it provides the foundations to troubleshooting NFF s which can aid in re evaluating system avionic design and establishing models for life cycle analysis Life cycle monitoring has been used to conduct prognostic remaining useful life RUL estimates of circuit cards inside of a space shuttle s solid rocket booster 51 Vibration time history was recorded throughout all stages of the shuttle s mission and used with physics based damage assessment models to predict the health and time before the next expected electronic failure A similar methodology was applied to the end effector electronics unit inside the space shuttle s remote manipulator systems robotic arm 52 In this case loading profiles for both thermal and vibrational loads were used with damage models inspections and accelerated testing to predict the component integrity over a 20
17. profile integrity testing methods currently being championed Most notable of these are the use of X ray and thermal imaging X ray inspections that can highlight shorts or coupling faults buried within the layers of multiplayer printed circuit boards non invasively Sankaran et al 88 discusses the use of X ray lamino grophy for accurate measurements of solder joint structures through 3D image reconstruction using artificial neural networks Automated inline systems based on X ray transmission have several advantages over optical inspection Optical inspection is restricted to surface inspection of visible solder joints Conse quently leads and ball grid arrays cannot be inspected by optical means More sophisticated features concerning the solder volume fillet voids and solder thickness can reliably be determined only by X ray transmission Therefore by X ray inspection generally a 15 Physics of Failures PoF is a concept utilized to understand the processes and mechanisms that induce failure within a component This includes studying physical chemical mechanical electrical or thermal aspects which influence the performance of the component over time until it eventually fails to meet any system requirements better test performance is achieved in terms of false alarm rate and escape rate and it is to be favored for closed loop process control 89 The use of infrared imaging for non destructive evaluation of electrical component
18. sensor Lo and Furse 66 provide research Various reliability and maintenance databases can been compiled such as 63 eliciting information useful in scheduling maintenance and design activities 10 FRACAS Failure Reporting Analysis and Corrective Action System is a reactive procedure often utilized after failures have occurred within a system It is used to collect data report categorize analyze information and to plan corrective actions in response to those failures 202 S Khan et al Reliability Engineering and System Safety 123 2014 196 208 into similar faults but using a differing kind of reflectrometry known as noise domain reflectrometry NDR which make use of existing data signals in the wiring With this method results show the potential to localise intermittent faults within 3 inches in 180 ft of electrical wiring However caution must be taken when using these methods as little is known on the impedance profile of intermittent faults with exception to open and short circuits Also promising are reflectometry methods that are proving to be useful when applied to locating intermittency in an F 18 flight control harness 67 they do require exceptional accuracy in baseline comparisons In civil and military aerospace recording and maintaining TDR data archives for even a limited number of circuit s May prove to be enormous and costly 68 Another technique called spread spectrum time domain reflectometry S
19. year period Lall et al 53 presented a methodology to calculate prior damage in electronic interconnects operating in harsh environments and hence subjected to highly cyclic and isothermal thermo mechanical loads with assessment predictions in good correlation with experimental data using a health monitoring tools Understanding electronics from a system point of view rather than a set of individual components is claimed by VEXTEC Corporation to be paramount to developing life cycle prognostic models as part of a failure reduction metho dology 11 The proposed methodology has far reaching conse quences on how the operators can manage a fleet of aircraft based upon risk rather than guessing degradation levels It is argued that by doing this NFF failure events can be reduced by the ability to prioritise the order of components replaced during a reported failure event based on probabilities Developing methodologies and damage assessment algorithms are gener ally aimed at creating an in situ load monitoring and prognostic capability This is explored by Vichare et al 54 who provides the necessary considerations for raw data processing during in situ monitoring and methods to reduce memory requirements and power consumption These are key factors that often limit the integration of health monitoring systems particularly into aircraft Skormin et al 55 developed failure prognostics for aircraft avionics using data mining models with me
20. Fault diagnostics issue the authors examine the technical aspects reviewing the common causes of NFF failures in Sea electronic software and mechanical systems This is followed by a survey on technological techniques estabili actively being used to reduce the consequence of such instances After discussing improvements in testability the article identifies gaps in literature and points out the core areas that should be focused in the future Special attention is paid to the recent trends on knowledge sharing and troubleshooting tools with potential research on technical diagnosis being enumerated 2013 Elsevier Ltd All rights reserved Contents T MAEPOMUCHOMN Ss 2 a saianaceice 5 aletin s a eina duce cau phos tava gh coe ay Meanie ams Hobie Back tak E E ty Seat aioe A Based dea da 1B E ESE date ee dp EEE a ETE tenes ESAE 197 ING fatilt round occurrences IM systems asas dc saontactsts on Aces casas inedia hess seca Dae tous Shas Rhea Sia yagesy odes rae EOE ai EE E Edenh ad fe a i a a aa 197 Dlx Electronic SY SCOTS rept a assy cz aia cS ayes ac EE eat bod Sa avon Bb fans hehe sala E aa bt oy TE NE E idea waa ida UOTE Adiga E OA EE 197 2 1 1 Printedcircuit board interconnectOrS sa sneda ennai eie o iaae aiii a a B aE aE pE do s Ae O n EEEE E tam abt 197 2 1 2 Harness WiNDE sass nean eiiieaen E aN Tay Ea EE E E Woh sgl om Aas nid dose eben bors al Baad N E 198 22 Mechanical Systems x ais xis iniae aos man a n R E a Bi Gia wis
21. I Popyack LJ Data mining technology for failure prognostic of avionics IEEE Trans Aerospace Electr Syst 2002 38 2 388 403 56 Karim R Candell O Soderholm P E maintenance and information logistics aspects of content format J Qual Maintenance Eng 2009 15 3 308 24 57 Larsson Kraik P O Managing avalanches using cost benefit risk analysis Proc Inst Mech Eng Part F J Rail Rapid Transit 2012 226 6 641 9 58 Stamatis DH Failure mode and effect analysis FMEA from theory to execution Asq Press 2003 59 Byington CS Kalgren P Dunkin BK Donovan BP Advanced diagnostic prog nostic reasoning and evidence transformation techniques for improved avionics maintenance In Aerospace conference 2004 Proceedings IEEE Vol 5 2004 60 Ungar LY Testability design prevents harm IEEE Aerospace Electr Syst Mag 2010 25 3 35 43 cited by since 1996 4 61 Morris NM Rouse WB Review and evaluation of empirical research in troubleshooting Hum Factors J Hum Factors Ergon Soc 1985 27 5 503 30 62 D eon P Langley M Atamer A Case based reasoning system and method having fault isolation manual trigger cases U S patent application 11 734 862 2007 63 Millar RC Mazzuchi T Sarkani S Application of non parametric statistical methods to reliability database analysis SAE technical papers 64 Atamer A Comparison of fmea and field experience for a turbofan engine with application to case based reasoning In IEEE aerospace
22. OTS solutions unpredictable effects and integration faults are likely to undermine critical software functions which can be difficult to diagnose and locate 30 Investigations into failures within aerospace missions have highlighted critical failures that are due to such components along with incomplete software specifications 31 Many of the reported issues in this paper can be attributed to complacency and misunderstanding of software functions in the way they interact and the lack of applying good practice principles In many cases desired sources of information are not readily available or are incorrectly configured to support rapid diagnostics or lack sufficient depth of information and practicality Additional factors include the failure to complete or store documentation and the lack of robust diagnostic fault trees connecting event system faults 5 This results when a unit is replaced without determining the nature of the fault risking its recurrence to cause an NFF event The complexity brought by embedded software and electronics pose unprecedented chal lenges in maintenance and repair threatening customer satisfac tion and causing increasing warranty cost on repair 32 33 3 Emerging resolution practices From a technical standpoint an NFF tagged component is the result of an unsuccessful or inefficient troubleshooting regime of an unplanned maintenance event Several maintenance strategies are usually sough
23. Reliability Engineering and System Safety 123 2014 196 208 Contents lists available at ScienceDirect ENGINEERING amp SYST Reliability Engineering and System Safety ita journal homepage www elsevier com locate ress ny Review No Fault Found events in maintenance engineering Part 2 Q cos Root causes technical developments and future research Samir Khan Paul Phillips Chris Hockley Ian Jennions EPSRC Centre School of Applied Sciences Cranfield University College Road Cranfield Bedfordshire MK43 OAL United Kingdom gt Cranfield Defence and Security Cranfield University The Mall Shrivenham Oxfordshire SN6 8LA United Kingdom IVHM Centre School of Applied Sciences Cranfield University University Way Cranfield Bedfordshire MK43 OFQ United Kingdom ARTICLE INFO ABSTRACT Available online 22 November 2013 This is the second half of a two paper series covering aspects of the no fault found NFF phenomenon Keywords which is highly challenging and is becoming even more important due to increasing complexity and No fault found criticality of technical systems Part 1 introduced the fundamental concept of unknown failures from an Test equipment organizational behavioral and cultural stand point It also reported an industrial outlook to the problem Troubleshooting failures recent procedural standards whilst discussing the financial implications and safety concerns In this
24. STDR is commercially being used to identify faults in electrical wires by observing reflected spread spectrum signals Parkey et al 69 CMOS integrated circuits IC are routinely tested using supply current monitoring which is based upon the knowledge that a defective circuit will produce a significantly different amount of current than fault free circuits Smith and Campbell 70 have developed an in situ quiescent current monitor that detects in real time elevations in the leakage current drawn by the IC whilst in a stable state Other similar current monitors have been reviewed by Pecht 43 Damage to electronic solder joints are a major contributor to intermittency in electronics and hence are a direct contributor to the NFF phenomena Damaged solder points are notoriously difficult to detect without extensive visual inspections They do however produce large variations in thermal resistance which can be used as a potential suitable method for monitoring solder joint fatigue inside of the packaging of power modules Bhatia et al 71 have used this principle as a basis to develop and test a new solder joint fault sensor known as the SJ Monitor which provides the ability to monitor selected I O pins of powered off FPGA s The use of RF impedance is also used as a failure precursor and offers interesting prognostic capabilities for solder joint failures due to the nature of gradual non linear increases in impedance as damage increase whereas
25. This would include evidence of failed equipment found to function correctly when tagged as NFF and hence improve maintenance processes extend life reduce whole life costs and improve future designs There is currently a drive in the majority of industries to turn away from the more traditional preventive and reactive maintenance actions described above in favor of more predictive and proactive solu tions 21 CBM is often regarded as the most advanced predictive main tenance strategy and hence could be aimed at reducing the number of machinery breakdowns by fault detection at an early incipient stage 5 10 36 It makes use of measurements of physical parameters while monitoring the trends over time any indication of abnormal behavior will trigger a warning In its simplest form threshold warning levels are constructed to trigger maintenance activities when a specific parameter shows measurements outside of the threshold regions In corrective maintenance much of the time is spent on locating a defect which often requires a sequence of disassembly and reassembly Recently condition monitoring of railway wheels with NFF problems was investigated by Granstrom and Soderholm 37 The authors provided a perspective on how such technologies can be applied and utilized for more effective and efficient maintenance management while initiating a discus sion on the maintenance requirements of systems and the man agement regimes which are f
26. ability Ref No 1999 189 IEE Seminar 1999 pp 6 1 6 4 19 Gibson AW Choi S Bieler TR Subramanian KN Environmental concerns and materials issues in manufactured solder joints In Electronics and the Environment 1997 ISEE 1997 Proceedings of the 1997 IEEE international symposium on 1997 p 246 251 20 Swingler J The automotive connector the influence of powering and lubricating a fretting contact interface Proc Inst Mech Eng Part D J Autom Eng 2000 214 6 615 23 21 Khan S Phillips P Tackling no fault found in maintenance engineering In First annual symposium in no fault found 2013 22 Shawlee W Humphrey D Aging avionics what causes it and how to respond IEEE Trans Compon Packag Technol 2001 24 4 739 40 23 Khan S Phillips P Hockley C Jennions I Towards standardisation of no fault found taxonomy In First international through life engineering services conference 2012 2012 p 246 253 24 Warrington L Jones JA Davis N Modelling of maintenance within discrete event simulation In Reliability and maintainability symposium 2002 Pro ceedings annual IEEE 2002 p 260 265 25 Ramohalli G The honeywell on board diagnostic and maintenance system for the boeing 777 In Digital avionics systems conference 1992 Proceedings IEEE AIAA 11th IEEE 1992 p 485 490 26 Beniaminy I Joseph D Reducing the no fault found problem contributions from expert system methods In Aerospace confere
27. asured parameters which included vibration temperature power sup ply functional overload and air pressure These parameters measured in situ use time stress measurement devices The purpose of the model included understanding how the role of measured environmental factors impact upon a particular fail ure investigating the role of combined parameter effects and to re evaluate the probability of failure on the known exposure to adverse conditions 3 1 2 Knowledge sharing Engineers have recently empathized that there is need for on field experience to be shared within a troubleshooting workflow repository 21 Aspects of content sharing such as e maintenance 56 can be beneficial for other maintenance personnel who will then be able to identify the cause of a problem on their first attempt whenever or wherever it next occurs Furthermore the captured knowledge over time can assist designers in improving the reliability of the equipment At the core of the challenge for better troubleshooting is the difference between anticipated fail ures captured within the design and the actual failures that appear in service When complex equipment is designed engi neers typically identify the potential failure modes and their effects on the system using a FMEA With this information it can be determined how best to employ on board diagnostic or BIT technologies to detect failures These can implement Prognostics and Heal
28. ating environment is needed Once this is known appropriate test equipment can be selected to support the ATE which through interpretation of the physics for example of circuits under the test environment to be used as fault locators a capability often beyond that of standard ATE In fact Kimseng et al identified a PoF process to identify induce and analyze not only failure mechanisms causing intermittent failures but also high warranty returns and NFF problems of the digital electronic 85 As previously discussed many of the faults which contribute to NFF events in electronics are of an intermittent nature These usually provide a challenge at necessary levels of confidence and efficiency these standard to signal processing algorithms which are often designed with permanent faults in mind 86 Some work on resolving such issues have been carried out using algorithms that make use of Bayesian networks to decompose large systems containing multiple components that may potentially fail during operation 87 Such probabilistic approaches often prove useful for study the performance behavior of underperforming subsys tems that eventually lead to a system failure Typical circuits are usually tested one at a time or just a few circuits at a given time and unless the intermittent fault occurs within the time window of the test the fault will go undetected 74 This is compounded further by digital averaging of results which indicates that con
29. ault diagnostics and system design have been the main focus for NFF journal publications within the past two decades Part 1 also focused on no fault found NFF standards and how such events can cause unprecedented changes in the service performance impact dependability and escalate safety concerns This has long been revealed with a variety of products within a wide range of industries 1 2 3 4 This paper aims to elaborate on these outlooks from Part 1 whilst examining the technical aspects for complex systems and equipment particularly products integrated within aircraft computer systems and how such events can have a significant effect upon the overall unit removal rate Historically such removals have been seen as an unavoidable nuisance 5 but this viewpoint is no longer acceptable if the unit removal rate is to be managed effectively 6 7 Unlike those failures that result in Confirmed Faulty events the designer may have no direct influence on those aspects of the system that determine the NFF failure rate therefore a direct mitigating action during the design phase is likely to be more difficult It can be argued that any product removal that does not exhibit a failure during subsequent acceptance test can be tagged as NFF Also for a number of these events further investigation could conclude that the reason for the removal event was categorically caused by an external effect None the less this would still be clas
30. but solutions to mitigate the problem are certainly not universal even within some individual organizations let alone across a common industry sector Some of this effort is being directed at the design and production stages where there is a need to create more fault tolerant systems which perhaps incorporate in built redundancy or self testing mechan isms Also there is a requirement for some thorough research effort into understanding intermittency Understanding intermit tent faults will rely on the ability to describe the various interac tions accurately and how mechanical software and electronic elements all have to interact together Modeling of intermittent faults will be required but will need to include probabilities of fault detection and the effects intermittent failures have on other dependant systems A thorough understanding of individual systems will be required in order to provide fault models and models that deal with false BIT alarms and the root causes of BIT deficiency In some industries and individual companies adopting better prognostics has ensured that important operational para meters are monitored at all times to identify adverse and out of limits variations These technologies have helped to introduce a change from a policy of reactive maintenance to a predictive policy which would concentrate on providing vital information on the root causes of failures which is not provided with traditional BIT BITE Other techn
31. degradation detection of electronic products Microelectr Reliab 2012 52 2 439 45 48 Hoyland A Rausand M System reliability theory models and statistical methods Wiley 2009 Chapter 3 49 Born FH Boenning RA Marginal checking a technique to detect incipient failures In IEEE proceedings of the national aerospace and electronics conference Vol 4 1989 p 1880 1886 cited By since 1996 2 50 Burns DJ Cluff KD Karimi K Hrehov DW A novel power quality monitor for commercial airplanes In Conference record IEEE instrumentation and mea surement technology conference Vol 2 2002 p 1649 1653 51 Mathew S Das D Osterman M Pecht M Ferebee R Prognostics assessment of aluminum support structure on a printed circuit board J Electr Packag 2006 128 4 339 52 Shetty V Das D Pecht M Hiemstra D Martin S Remaining life assessment of shuttle remote manipulator system end effector In Proceedings of the 22nd space simulation conference 2002 p 2123 53 Lall P Hande M Bhat C Suhling J Lee J Prognostic health monitoring phm for prior damage assessment in electronics equipment under thermo mechanical loads In IEEE electronic components and technology conference 2007 p 1097 1111 54 Vichare N Rodgers P Eveloy V Pecht M Environment and usage monitoring of electronic products for health assessment and product design Int J Qual Technol Quant Manage 2007 4 2 235 50 55 Skormin VA Gorodetski V
32. fect upon the systems operation with mechanical failures As a result this allows inspection criterions to be developed during the design phases 23 It should be noted that as with many electrical failures mechanical failures can be intermittent in nature and only occurring under specific operating conditions Some of the more common mechanical failures which are of interest but receive a lot less attention then the electrical failures which contribute to diagnostic failure are 1 Broken seals and leaks Leaks from broken seals will affect the operation of items which include engines gearboxes control actuators and hydraulic systems The nature of seal design is that they are often designed to slightly weep This is a good example of the need for maintenance personnel to be familiar with the system and hence be aware of what constitutes acceptable leakage in order to avoid unnecessary removals 2 Degradation of pneumatic and hydraulic pipes Degradation within pipes often occurs due to corrosion or fretting against other components or structures The nature of pneumatic hydraulic systems is that under pressure they may develop small leaks These minor leaks may result in an alarm to the operator indicating failure resulting in the unwarranted shut down of the system when no equipment malfunction has actually occurred 3 Backlash in mechanical systems One area where backlash can cause significant concern is within actuation systems part
33. h task E a Redken steely ay D m S aani Gey brea Rais dike haba en E E 198 235 Software SYSTEMS si cio naana jong eae E E A Giana EE E auie aya saeanaue aS gues a ede E A O E E AOE Genta lew susan Stee wire guerens 198 3 Emerging TESOMMIOM practices scien i ent od ss ees oo REEE EAE ra ack ERE E E E EE E ete a ue Spa aR E EERTE aa aN EAEE eee 199 314 Healthan nd sage monitorin ssn erimi sas o EE EEEE E E A E E EO EEE EEE ape EREE 199 3 1 1 Monitoring and reasoning of failure precursors and loads 0 cc ccc ce cect nent onene reneo 200 3 1 2 KnO Wed Se sharing isis cccasiesrdy ce aratets ee iesisi 0 ayses de EE wins oases dua T EE E E lars Mi dese dudes G4 Aas ASME ETE 200 3 2 Test i Quiprme nse secors s aeria aioe a aa e epee a EEE aia ek cant E OE a GH eels RS AE E amp Soe Rode ETE Goyer EOE Sded wae AS 201 3 2 1 BUG SCS Ea ct is nue EE EE ab asc EEE A A TE E OEE E E EE E 202 3 2 2 Other Method Sssorrenro ne erat ee aad E E NA ao ated E E E E ana E e E E E E E E EE N eta 203 4 Improyements iN test abilities i sccxi ceatis deter nE EERE EE EE EE E a EE E E E E EE EE EA 203 Al Detectie bind spots aci pne enee DE EEE EE ETE E EE EENE E ORNS A N n E E E E ES EE a 203 4 1 1 Environmental testih nni termir eine steal ginis E E E Ea E E a ve E E EE DD A R E EA eg eR 204 4 1 2 Tracking spare Parts canoso cca caaste tesa mne sie aeea teva E E DE a awe E E EE E EEE E E EE 205 57 Disc ssiom on gaps in literatur s cesme e
34. han et al Reliability Engineering and System Safety 123 2014 196 208 207 Authors would like to express their thanks to Casebank Technol ogies Inc Copernicus Technology Ltd FlyBe UK and the RAF for sharing their experience with NFF problems References 1 Chen J Roberts C Weston P Fault detection and diagnosis for railway track circuits using neuro fuzzy systems Control Eng Pract 2008 5 16 585 96 2 Hockley C Phillips P The impact of no fault found on through life engineering services J Qual Maintenance Eng 2012 18 2 141 53 3 Jeong JS Park SD Failure analysis of video processor defined as no fault found nff reproduction in system level and advanced analysis technique in ic level Microelectr Reliab 2009 49 9 1153 7 4 Pecht M Jaai R A prognostics and health management roadmap for informa tion electronics rich systems Microelectr Reliab 2010 50 3 317 23 5 Soderholm P A system view of the no fault found nff phenomenon Reliab Eng Syst Saf 2007 92 1 1 14 6 James I Lumbard D Willis I Goble J Investigating no fault found in the aerospace industry In Reliability and maintainability symposium 2003 Annual 2003 pp 441 446 7 Challa V Rundle P Pecht M Challenges in the qualification of electronic components and systems IEEE Trans Device Mater Reliab 2013 13 1 26 35 8 Sood B Osterman M Pecht M Tin whisker analysis of toyotas electronic throttle control CircuitWorld 2011 37 4 9 9
35. hin equipment and if there is an 206 S Khan et al Reliability Engineering and System Safety 123 2014 196 208 intermittent NFF problem then the equipment requires NFF intermittency capable testing equipment 2 Integrity testing Most standard maintenance procedures employ only functional testing which determine if the equip ment is within appropriate tolerances for service They do not capture the level of damage or degradation within the equipment information which could be vital for predicting the probability of intermittency or other failure modes Integ rity testing should be incorporated into the maintenance process and data management techniques should then be developed to provide a diagnostic history and prognostic capability It is proposed that assessments of currently available testing methods should be investigated and developed to provide this integrity assessment capability 3 Maintenance manuals The current standard in troubleshooting guidance is the Fault Isolation Manual These manuals can be costly to produce and maintain within a dynamic environment and are often tied to the technical publications cycle usually meaning several months between updates Depending on organizational and cultural factors it might not be effective to put all the troubleshooting knowledge in a paper based or electronic guidance format and hence a diagnostic reasoning engine might be an effective system to implement 42 4
36. icu larly those used for aircraft control surfaces It is possible that with excessive wear in actuator couplings position sensors may indicate incorrect operation including asymmetric settings which are difficult to isolate from a maintenance perspective 2 3 Software systems It is clear that a great deal of NFF occur in avionics electrical and electro mechanical systems however research discussions have also revealed that software including built in tests BIT is also a key contributor to the problem 5 24 25 26 This includes 1 Processing delays 2 Discrepancies between software testing procedures 3 Timing errors S Khan et al Reliability Engineering and System Safety 123 2014 196 208 199 4 Lack of appropriate training 5 Perhaps a poorly written program code Industry specific standards exist such as IEC 62278 27 for railways or the IEC 60812 28 is often referred to when carrying out Failure Mode and Effects Analysis FMEA for software based systems that can be used to validate software operation and meet specific requirements However since standards and guidelines are prepared to be generic they only briefly consider the handling of any malfunctions caused by software faults and their effects in FMEA 29 Software components are often delivered with little access to the source code which only provides a partial view of their internal functionality With restricted access in these off the shelf
37. ility is a design related character istic which if designed well will provide the capabilities to confidently and efficiently identify existing faults The number of tests and the information content of test results along with the location and accessibility of test points define the testability potential of the equipment The two attributes which must be met for testability success are 1 Confidence this is achieved by frequent and unambiguously identifying only the failed components or parts with no removals of good items 2 Efficiency this is achieved by minimizing the resources required to carry out the tests and overall maintenance action This includes minimal yet optimized man hours test equipment and training It is evident that the conventional ATE methods used within the maintenance line as required from the testability design are not successful 2 5 21 83 They perhaps are not carrying the necessary levels of confidence and efficiency or are inappropriate in the many industries which are suffering NFF difficulties If testability as a design characteristic was successful NFF would not be so problematic This is particularly evident in the case of attempting to detect and isolate intermittent faults at the test station The ability to test for short duration intermittency at the very moment that it re occurs using conventional methods is so remote that it will almost certainly result in a NFF The one major issue with desig
38. integrity is a well known practice 90 The basic principle of using infrared imaging as an integrity test is that faulty connections and components in an energized circuit operating will begin to heat up before they fail the use of a thermoscope would scan the devices in the circuit from one end to another and the hotter the target the more energy that it will emit in the infrared portion of the electromagnetic spectrum For many electrical components such as resistors and capacitors the build up of heat will be entirely normal but for many components the build up of heat or even lack of heat will indicate a problem 4 1 1 Environmental testing The environmental conditions of a product or system can also be analyzed to assess its on going health and to provide an advance warning of failure 54 91 Products often behave differ ently during varying operational conditions normal or extreme which result in fault symptoms manifesting themselves only under those specific conditions Examples include when tempera ture widely fluctuates or stress is applied in the form of vibration conditions which will not normally be present during laboratory testing Most products will undergo environmental testing to prove their reliability and robustness under the most extreme operating conditions as part of their certification process but a more subtle set of environmental testing can also be used as part of the maintenance process which tries to simu
39. ken very serious by major aerospace manufactures such as Messier Dowty for use in future landing gear health management systems and the world s two dominant airlines Boeing and Airbus In 2005 Boeing announced that in order to improve its ability to track and maintain service histories of its parts it would require many suppliers of high value parts to its new 787 Dreamliner aircraft to place RFID tags on all parts before shipping them to Boeing Even though RFID tagging is 16 Units which have been taken out and sent back for repair multiple times are tagged as rogue units considered an expensive option Boeing argues that for the additional cost of 15 per tag for a 400 000 primary flight computer the life cycle information gained would more than justify the additional expenditure to their customers 97 In early 2012 Boeing Commercial Aviation Services were still awaiting Federal Aviation Administration FAA certifications for RFID tracking systems aimed as a standard component on all new 737 777 and 787 commercial aircraft as well as a variety of their military aircraft Similarly Airbus is also promoting the adoption of RFID in the aircraft industry and are developing RFID part tracking systems for their new A400M military transport plane as well as for the A380 commercial jet 98 5 Discussion on gaps in literature In the past few decades there has been a great deal of research in order to address the NFF issue
40. ket insertion 3 also tin whisker growth is much more likely in lead free solder to cause short circuits 21 damages internal wiring without external evidence Such type of wiring faults are extremely difficult to detect and can lead to risk the maintenance crew rejecting products incorrectly which are associated with this particular signal path Wire breaks are common in harnesses and are likely to manifest as a hard fault for a period determined by the vibration and temperature profile However in order to correctly isolate the failure in an ambient environment stressing of the harness may be necessary to simulate the conditions in which the failure occurred In cases where fault is intermittent and the exact operating conditions are not known the failure may not be correctly attributed as being in the harness which will lead to the suspicion that the unit is at fault and requires replacing This is particularly true for those maintainers who operate within the constraints of fast turnaround times 2 2 Mechanical systems The failure mechanisms within a mechanical system are widely regarded as having less of an effect upon the rate of NFF occurrences than those which are present within electrical sys tems The causes of failure in mechanical systems are similar to those in electrical systems such as ageing poor maintenance incorrect installation or usage The difference however is that it is much easier to predict the ef
41. late a more normal mode of operation In effect when designing for DfT information gathering exercises can be designed to study system behavior where such variations are present i e Design of Experiments DoE 53 These may provide essential statistical information for planning experiments on process models in order to obtain data that can yield valid and objective conclusions In any case there are three main environmental conditions which should be controlled for a good diagnostics test humidity vibration and temperature However testing standards do not require these environmental factors to be done together 2 Each of these will depend on many factors for example temperature and humidity will fluctuate with variables such as altitude time of year current weather patterns whilst vibration is dependent upon such things as smoothness of roads runways location in the vehicle and the vehicle activity i e a fighter aircraft cruising or in a battle scenario These three conditions can be simulated with relative ease through the use of market available environmental chambers White and Richardson 92 provide an overview of the differing types available and the variety of tests which can be carried out in them to investigate the event of NFF issues for aircraft assemblies In this research paper the authors also warn that environmental testing is not the definite solution to identify ing all faults There is also a need to get operati
42. liable parts 60 However most of the knowledge only resides within the heads of a few key experts or in personalized organizational databases which usually are consulted only after a problem has resisted several attempts at resolution Therefore on site experience must be blended with other diagnostic and prognostic tools and techniques 42 The obvious challenges here are 1 To store this experience based knowledge and deliver it at the time and place that the same problem symptoms occur so that it can be re used to help solve the problem on the first attempt 2 To deliver that knowledge in a form that is useful to experts and less experienced technicians alike 3 To share this knowledge so that everyone benefits from the experience of others 4 To integrate the knowledge access with the existing trouble shooting tools so that it becomes part of the usual trouble shooting workflow Human factors must be considered with respect to troubleshoot ing performance 61 A diagnostic reasoning system could hence be useful to provide an such information along with high quality feedback to the design engineers 62 With the entry of symptoms the possible failure modes can be identified from the knowledge database and increasingly incisive information can be requested To the troubleshooter this can act as efficient guidance to the design engineer this can be an intelligent interview automatically being applied anytime that the
43. nce proceedings 2002 IEEE Vol 6 2002 p 6 2971 6 2973 vol 6 27 J Xie M Pecht Applications of in situ health monitoring and prognostic sensors In The ninth Pan Pacific microelectronics symposium exhibits and conference 2004 p 1012 28 Commission IE IEC 60812 analysis techniques for system reliability proce dure for failure mode and effects analysis FMEA 2006 29 Wright R Kirkland L Nano scaled electrical sensor devices for integrated circuit diagnostics Vol 6 In IEEE aerospace conference 2003 p 25492555 30 Mariani L Pastore F Pezz M Dynamic analysis for diagnosing integration faults IEEE Trans Software Eng 2011 37 4 486 508 31 Leveson NG Role of software in spacecraft accidents J Spacecraft Rockets 2004 41 4 564 75 32 Brombacher A Hopma E Ittoo A Lu Y Luyk I Maruster L et al Improving product quality and reliability with customer experience data Qual Reliab Eng Int 2012 28 8 873 86 33 Izquierdo LE Ceglarek D Functional process adjustments to reduce no fault found product failures in service caused by in tolerance faults CIRP Ann Manuf Technol 2009 58 1 37 40 34 Meseroll RJ Kirkos CJ Shannon RA Data mining navy flight and maintenance data to affect repair In Autotestcon 2007 IEEE 2007 pp 476 481 35 Jardine AK Lin D Banjevic D A review on machinery diagnostics and prognostics implementing condition based maintenance Mech Syst Signal Process 2006 20 7 1483 510
44. ncies 2 RF impedance Kwon 72 worked on developing an RF impe dance method to provide an early indication of interconnect failures The technique has better sensitivity towards degrada tion as compare to its DC counterpart due to the phenomenon known as the skin effect The method takes advantage of the surface concentration of high speed signals depending on the material characteristics being passed through the connection whilst monitoring the frequency response 3 Functional process methodology In order to eliminate warranty related NFF events Izquierdo and Ceglarek 33 demonstrated a methodology based on design tolerances that integrate service or warranty data with manufacturing measurement and existing product models 4 Improvements in test abilities Testability as defined by IEC 60706 5 72 is a quantitative design characteristic which determines the degree to which an item can be tested under stated conditions As more sophistication is added to electronic systems the ability to maintain them is becoming ever more difficult and costly Standard testing using automatic test equipment ATE usually includes features such as timing signal strength duplicating the operating environment loading fanout and properly interconnecting the unit under test UUT 60 79 80 81 82 The idea of ATE is to force the UUT to fail without actually injecting faults The ability to do this is directly related to its testability Testab
45. ng testing organizational imperatives operator priorities technological capabilities contractual agreements and financial management This study highlights the fact that the majority of research that has been published primarily lies within aerospace proceedings such as IEEE publications and other engineering out lets Surprisingly there are no dedicated textbooks on the topic and the authors strongly feel that the maintenance community will benefit from its publication Also the authors advocate that the focus of published material needs shifting from the technical issues towards the business side This could be used as an opportunity to quantify the costs involved in NFF events and might influence the way contractual agreements are being setup now a days Each industry sector approaches NFF differently i e OEM maintenance suppliers and operators manufacturer etc When unplanned maintenance regimes are initiated the costs along the supply chain warranty downtime operational fines are expected to raise concerns In either case researchers and scien tists should target to publish NFF related research in management and business journals to emphasize its importance This will help to promote knowledge in addition to overcoming barriers in NFF investment and the lack of a business case due to no standardized methods or metrics for costing impacts 6 1 Future perspectives The core areas where efforts should be focused on
46. nged S Khan et al Reliability Engineering and System Safety 123 2014 196 208 203 are being monitored It is important to recognize at this point that BIT will report failures for following two reasons 1 A specified parameter has exceeded a set threshold value 2 The noise of the BIT measurements throws the test results outside of the testing limits when the system under test SUT meets required specifications The first of these is a direct result of component failure for example a burnt out resistor The second occurs when a measured parameter which has noise is measured by an instrument having its own noise this is common in integrated manufacturing processes digital system timings and radar systems 78 One of the areas of concern within these statistical limits is that they may have been inappropriately set without a true understanding of hardware software interactions or the nature of the equipment s operating environment This will therefore inevitably lead to BIT false alarms 3 2 2 Other methods Some other techniques which have been proposed include 1 DC resistance Traditionally these techniques have been uti lized to monitor the reliability of electronic components as it is well suited for identifying electrical continuity However these methods do not often provide any early indication of failure of physical degradation and may not be sensitive enough for future electronics that operate at higher freque
47. nic product Microelectr Reliab 2012 52 11 2773 80 83 Sheppard JW Simpson WR Applying testability analysis for integrated diagnostics IEEE Des Test Comput 1992 9 3 65 78 84 Simpson W Kelly B Gilreath A Predictors of organizational level testability attributes Annapolis Maryland ARINC Research Corporation 1986 Publica toin 1511 02 2 4179 85 Kimseng K Hoit M Tiwari N Pecht M Physics of failure assessment of a cruise control module Microelectr Reliab 1999 39 10 1423 44 86 Guanqian D Jing Q Guanjun L Kehong L A stochastic automaton approach to discriminate intermittent from permanent faults Proc Inst Mech Eng Part G J Aerospace Eng 87 Abreu R Zoeteweij P Golsteijn R Gemund AJV A practical evaluation of spectrum based fault localization J Syst Software 2009 82 11 1780 92 88 Sankaran V Kalukin AR Kraft RP Improvements to x ray laminography for automated inspection of solder joints IEEE Trans Compon Packag Manuf Technol Part C 1998 21 2 148 54 89 Neubauer C Intelligent x ray inspection for quality control of solder joints IEEE Trans Compon Packag Manuf Technol Part C 1997 20 2 111 20 90 Maldague X Theory and practice of infrared technology for non destructive testing Wiley Ser Microwave Opt Eng 2001 91 Deng G Qiu J Liu G Lv K A novel fault diagnosis approach based on environmental stress level evaluation Proc Inst Mech Eng Part G J Aerospace Eng 227 5 2013 816 826 92
48. ning component testability is that the focus is on functionality and integrity of the system 46 Other difficulties with testability are that in most cases there is a complete lack of information regarding standardized tools for the evaluation of Design for Testability DfT For testability to be consistent within the design process to achieve the definitions procedures and tools must be developed A testability evaluation should not only provide predictions but also redesign information when testability attri butes are predicted to be below the acceptable levels There are three testability attributes which can be identified 84 1 Fraction of faults detected FFD Ideally this should be 100 Any fault not detected by either the BIT BITE or ATE can result in total loss of the system integrity and hence functionality In reality some faults not safety mission critical can be tolerated and so a FFD less than 100 may be acceptable when designing for testability 2 Fraction of faults isolated FFI If a detected failure is not isolated quickly and efficiently with high confidence then the system may end up being kept out of operation for significant periods of time The result of this leads to pressure on maintenance personnel who are then likely to adopt the shotgun approach of speculative LRU replacements adding pressure and complications to the sparing and logistics pro cesses increasing life cycle costs Appropriate measure
49. nt ageing rather than the fundamental design issue with the interface Another major contributor to solder joint damage is thermal stress related to heat expansion shock and vibration During operation these stresses causes metal metal interconnects to rub against each other to damage any protective coating Such effects cumulate over time and will typically last for periods less then hundreds of nanoseconds Such manifestations fracture the solder contacts and instigate intermittent faults Electrical intermittency is also caused by contact fretting 15 20 Fretting corrosion occur particularly in tin plated contacts as a degradation mechanism caused by the presence of humidity which oxidizes the metal metal interface The accumulation of oxides at the contacts causes an increase in resistance and electrical intermittency due to the repetitive sliding movements Other root causes of NFF events in electronics include creep corrosion and the phenomena known as tin whiskers 14 Creep corrosion is a mass transport process in which solid corrosion products migrate over a surface on inte grated circuit IC packages and eventually result in electrical shorts or signal deterioration due to the bridging of corrosion products between isolated leads Depending on the nature of corrosion product conductive or semi conductive dry or wet the insulation resistance can vary thus potentially causing inter mittent loss of signal integrity A pure tin
50. ology improvements such as the use of RFID technology has been adopted to track units within the supply chain and to monitor the complete service history of items while they are in the supply chain Such technology solutions will go some way to mitigating NFF but what is needed is a comprehen sive approach dealing with organizational procedural and beha vioral issues as well as all the technical issues The ability to map a NFF event from the initial reported failure through the entire maintenance process would provide invaluable information iden tifying the critical operations and procedures which are failing From the literature research within this paper it is possible to identify the following core gaps in NFF failure related research 1 The problem of intermittency It is clear that intermittent fault occurrences are a major technical root cause of NFF and that there is a clear lack of fundamental understanding on inter mittency in electronics Also there is clear evidence to suggest that the current technology in use for detecting and locating the source of the intermittency is inadequate If NFF becomes worse over time despite improved management processes then the cause is likely to be inadequate equipment for testing electrical intermittence In this case there needs to be a change in the way an electronic device or wiring harness is tested in order to solve the problem The nature of the NFF needs to be understood and tracked wit
51. onal information which includes field data maintenance history and failure prob abilities to determine if the failure in the unit is real or if itis ina different unit or even a false alarm However gaining this information can be tricky and would require additional work on behalf of pilots or operators in recording the events which led to the failure signal along with changes to procedural practices in maintenance record keeping or retrieval Often an overlooked area when considering an environmental test is the orientation of the UUT when embedded within its operating platform The orientation can mean that differing components are more affected by vibration than if the UUT was in a different position and so the orientation of the UUT should be a consideration when undergoing environmental testing S Khan et al Reliability Engineering and System Safety 123 2014 196 208 205 4 1 2 Tracking spare parts The ability to recognize rogue units is of paramount impor tance in mitigating the effects of NFF events and to ensure operating safety particularly in the case of an aircraft The key to distinguishing a rogue unit is to implement the necessary proce dures to track rogue units by serial number showing the date installed and removed the platform on which the unit was installed number of operating hours cycles number of hours since its last overhaul and a solid reason for the generated removal codes In addition to this the hi
52. or voltage variation and seems to have become an attractive tool for detection of the intermittency 73 74 Conducting the intermittency test simultaneously provides an increase in prob ability of detection combined with the reduction in the time taken to complete the test because the testing is performed for multiple points rather than testing one line at a time means that this is potentially an effective test methodology It has been used on the F 16 AN APG 68 Radar system Modular Low Power Radio Frequency MLPRF unit where 36 million dollars worth of assets previously deemed unrepairable have been returned as service able The equipment has also shown considerable promise in the UK military on the Tornado and Sentinel aircraft fleets 2 Other similar work on intermittent fault detection has been done by Muja and Lamper 75 and Smith et al 76 3 2 1 Built in test As electronic equipment evolve into ever more complex sys tems they increasingly depend upon BIT to provide in situ fault detection and isolation capabilities particularly in low volume electronic systems in the military aerospace and automotive sectors BIT is a coherent assortment of on board hardware soft ware elements enabling a diagnostic means to identify and locate faults as well as error checking Its importance has therefore increased with system complexity as it enables equipment main tainability through better testability IEC 60706 5 58
53. orced onto those systems The ability to automate fault diagnosis with advanced technologies and techniques could be used to accurately predict the downtime and hence the operational availability In fact the role of diagno sability analysis in modern systems considering their complexities and functional interdependencies becomes significant as it improvements can lead to a reduction of a system s life cycle costs 38 However it should be noted that such setups are only worthwhile if the benefits can significantly outweigh the costs of its introduction and upkeep There are design constraints often involved with improving maintainability particularly in the airline 3 _there are other maintenance programmes that do not consider diagnostics or prognostics e g in time based preventive maintenance where replacement of parts is performance after a predetermined time interval measured by a relevant time measure e g hours cycles or tonnages independent of the condition 200 S Khan et al Reliability Engineering and System Safety 123 2014 196 208 industry when dealing with legacy aircraft The more general issues include 39 1 Any technological enhancements must work within existing architectures 2 The information available from lower test levels are typically predefined and costly to improve or change 3 Hardware development can be costly and outweigh potential cost saving benefits 4 There may be limited space for addi
54. r ence proceedings on the topic The aim is to provide a general picture of the research areas undertaken in past few decades and create a database of the academic literature of journal publications on NFF concepts and its applications from 1990 to 2013 by classification and statistical analysis It is evident that the NFF phenomenon has gained the most attention in the last decade This is possibly due to increasing system complexities reliability requirements and cost implications The article reported various occurrences and root causes that have resulted in NFF events Current industrial practices were discussed whilst highlighting the importance of capturing and sharing as much information as possible to support rapid diagnostics and troubleshooting work flow Furthermore emphasis was placed on the importance of having feedback mechanisms to transfer maintenance event information to design engineers who can use that information to determine how best to employ various diagnostics technologies e g BIT diagnostic reasoning ATE etc to detect failures in the future It seems that the role of having more specific standards solely focusing upon NFF mitigation might become much more prominent as they can promote best practice approaches within maintenance sectors However solutions will not reside only within different maintenance echelons but should also focus on a much broader scope considering factors such as design man ufacturi
55. resie sa i De ale wed EEEE EE E KEE E E EEE E EEE E E EEE Bead race Bakr ata Eaa 205 6 Concluding remMmarKS se sranane oia tie EEE EE E E E N AO ee a EE E EE ee eee ree eae 206 Gil FUtUTe perspectives suere enen En eid tas dey EEE TE EE gots E Gib i E E A EE E E REN 206 Acknowledsementsreisei sire nn a sore a n e e a So e e e ana e aaae E anions E E EEE Ee Gh 206 Referentes i eiaei s EE a Ha ab OEE OE oE gpa nel Mustang e kaiaa oa e Beaute AA aod a a N EOD E E E BIR Yee E eas 207 Corresponding author Tel 441234 75 0111 E mail address samir khan theiet org S Khan 0951 8320 see front matter 2013 Elsevier Ltd All rights reserved http dx doi org 10 1016 j ress 2013 10 013 S Khan et al Reliability Engineering and System Safety 123 2014 196 208 197 1 Introduction Part 1 extensively discussed the organizational complexities and challenges faced by businesses today in attempts to adminis ter solutions to the problems caused by unidentified failures It also described the applied method for collection and analysis of the referenced literature in detail This was included not only to judge the validity of these papers but also to present a statistical analysis of the academic journal publications on NFF concepts between the period 1990 2013 In addition the authors had categorized the literature into four main areas fault diagnostics system design human factors and data management where it was noted that f
56. s new knowledge and gained experiences This importance of continuous improvement is also emphasized by related standards such as IEC 60300 3 14 53 and EN 50126 27 or IEC 62278 52 It should be high lighted that FMEA analysis directly contributes to the development of effective maintenance procedures e g RCMandMSG 3 in the aircraft industry incorporate FMEA as the primary component of analysis as well as the identification of troubleshooting activities maintenance manual development and design of effective built in test requirements When the equipment enters service the Practical World imposes itself as shown in Fig 1 some faults that were anticipated will actually happen but some never do When a fraction of the theoretically possible failure modes occur the weaknesses in a piece of equipment will become evident during the operation It can then be extrapolated that equipment which fail on one aircraft are more likely to fail on other aircraft of the same design operated in similar conditions But most importantly many real world faults are not anticipated by the design engi neers and therefore the traditional diagnostic systems do not resolve them In those cases human ingenuity may resolve the problem but where does that knowledge reside after its creation Some the knowledge can make its way back into troubleshooting manual updates 36 59 and some may be fed back to engineering to modified designs for much more re
57. s of FFI include mean time to fault isolation MTFI mean time to repair MTTR and rates of NFF 3 Fraction of false alarms FFA or rate of false alarm RFA This is a measure of the rate at which detected faults results as a false alarm upon investigation It is computed as a time normalized sum of false alarms where the normalization is either calendar time or operating hours High FFA will also lead to maintenance pressures and the shotgun effect 4 1 Detecting blind spots When it is suspected that NFF occurs due to a lack of fault coverage by the ATE or BITE there comes the requirement to use additional tools which are capable of identifying the root cause of the problem Ungar and Kirkland 79 argue that to achieve this an understanding of the Physics of Failures 13 There are design techniques that are added to obtain certain testability features during hardware product design The premise of the features is that they can make it easier to develop and apply manufacturing tests and to validate that the product hardware contains no defects that could otherwise adversely affect the product s correct functioning e g boundary scanning 14 Lie the maintainer is left to troubleshoot the system using their best guess which will often result in the replacement and removal of modules that are perfectly good 204 S Khan et al Reliability Engineering and System Safety 123 2014 196 208 PoF within the oper
58. se failures modes appear When completing the troubleshooting the maintainers can automatically report on the failure mode and record detailed differentiating symptoms Also this information can be of great importance for a Failure Reporting Analysis and Corrective Action System FRA CAS procedure providing valuable insights to engineers 42 64 3 2 Test equipment Automatic test equipment ATE is widely used to perform device functional and parametric tests at the back end of the semiconductor manufacturing process 9 It is a capital intensive system and typically costs 1 3 M depending on the equipment performance An unscheduled equipment downtime lasting one hour could cause significant amounts of production loss The use of reflectometry has commonly been used to determine the integrity of cables and wiring with effective localization of inter mittent faults such as open or short circuits These methods send a high frequency signal down the line which reflects back at impedance discontinuities The location of the fault is determined by the phase shift between the incident and reflected signals Sharma et al 65 demonstrates a novel architecture for imple menting a sequence time domain reflectometry STDR method which uses a pseudonoise code to locate open and short circuits on active wires using an integrated CMOS sensor The approach has an accuracy of fault localization of 1 ft with low power consumption for the
59. sified as a NFF event as these external influences might be faulty sensors or actuator or possibly an incorrect fault isolation activity In any case as the device fabrication process continues to improve failure rates of hardware components have steadily declined over the years to the point where non hardware failures emerged as a dominant issue 9 whereas the reduction of troubleshooting complexities and time to fix a problem seem to be the most important aspects when investigating failures of electronic systems In addition to the a priori discussions from Part I this paper focuses on the following No fault found occurrences in systems Emerging resolution practices Improvements in test abilities Discussion on gaps in literature Future research directions OBRWN me The remainder of the paper is structured as follows after identifying the common root causes for NFF in system compo nents the brief survey s some industry specific innovations that have been introduced in order to capture troubleshooting data Section 4 discusses improvements in test capabilities followed by a discussion on the identified gaps in NFF literature Finally concluding remarks and future directions for research into 1 although there are specific approaches such as robust design 8 that can be used to design quality into products and processes by minimizing the effects of the causes of variation without eliminating the cause
60. story of the operating platform be that a wind turbine aircraft or train needs to be recorded with an easy to use retrieval system 2 The importance of such historical data is to aid in determining the exact effects the failure has on the overall system and whether the replacement of the unit offers a high level of confidence of rectifying the problem Some airlines in the UK operate within a spare parts pool where the policy is that if a unit is returned to the pool labeled NFF more than three times then that unit will be scrapped This has the advantage that the spare parts pool will become less polluted with units which are rogue However this only encourages the culture of accepting NFF and not searching out the root cause which may be a fundamental manufacturing flaw present in equivalent units such as a batch of faulty capacitors which have been used in the unit s production Likewise it could be a system design flaw leading to integration faults Either way scrapping units in this way will inevitably lead to an increase in costs 5 Other airlines routinely tag and track units that come back with similar reported failure symptoms multiple times These tagged units are then subjected to special testing that is not usually required such as thermal shock and environmental tests Units tagged as rogue are also tracked by the tail number of the aircraft from which they came Technicians then monitor and track repetitive serial numbers
61. t to improve upon this problem within organizations 1 Reliability If all components were 100 reliable i e they never resulted in a system failure then there would be no unplanned maintenance activities Design engineers often engage in relia bility improvements based largely on feedback from equipment in service However to the extent that engineers anticipate failures designers will incorporate fault detection systems notably BIT and prognostic strategies to keep track 2 BIT If BIT s were 100 comprehensive and unambiguous at the aircraft level including interacting systems 34 then it would i Detect every possible problem ii Point with certainty to the defective part and only where the problem was caused by a defective part as opposed to operator mishandling environmental circumstances etc But to the extent that BIT is lacking troubleshooting is required 3 Troubleshooting In theory if fault isolation manuals FIM or troubleshooting guides were perfect then every failure that 4 FMEA Failure Mode and Effects Analysis is recognized as one of the most effective methods to identify and remove critical reliability issues This procedure is commonly used to influence the system design before it is commissioned enumerating potential failure modes that may occur during operation These are proactively performed to assess the impact of various failure modes during the product development and maintenance stages 14
62. th Monitoring PHM strategies to detect impending S Khan et al Reliability Engineering and System Safety 123 2014 196 208 201 The Design World Failure Modes and Effects Analysis Design Engineers anticipating what will fail and preparing for it Built In Test Prognostic and Health Monitoring Functional Independence Measurement User Manual On site feedback to design The Practical World Failure Reporting Analysis and Cottective Action System Operators and maintainers experiencing what actually fails and recognizing it Fig 1 Troubleshooting Anticipated vs actual faults functional failures In addition this can also prepare troubleshoot ing procedures in advance for analyzing the functionality of the system in order to differentiate among the many possible root causes of these anticipated failures Procedures are contained in troubleshooting manuals or guides which require human involve ment to execute the tests and evaluate the results As good as they are these systems are often far from perfect nor should they be expected to be given the necessary practical cost performance tradeoffs 5 57 Furthermore existing RCM standards such as IEC 60812 29 FMEA IEC 60300 3 11 42 SAE JA1012 43 and experts related to FMEA Moubray 41 Stamatis 58 emphasize the importance of continuously updating them and making sure that it is a living document that reflect
63. th status and additional prognostic information can be evaluated and unexpected failures could be avoided A sum mary of potential failure precursors for electronics is defined by Born and Boenning 49 The life cycle environment of a product consists of manufacturing storage handling operating and non operating conditions which may lead to physical performance degradation of the product to reduce its service life Suppliers and operators particularly within the airline industry spend signifi cant resources attempting to determine the root causes of the NFF events but without any measured field conditions a root cause analysis can be problematic for capturing information This poses an even more significant challenge that requires additional specific sensing equipment and data loggers Burns et al 50 demonstrate the development laboratory and in flight testing of such specific equipment for monitoring the environment of aircraft avionic power system The equipment termed the Aircraft Environment Monitor Power Quality AEM PQ allows over two years of continuous data measurements to be collected for evaluation of the quality of power systems for different operational scenarios The hardware and data gathered is a prime example of the information gathering abilities which are required to evaluate 6 Reliability Centered Maintenance RCM is a structured approach to ensure that assets continue to do what their users require in their
64. tional processing capabil ities to support improved diagnostics However the authors would like to emphasize that if there is no safety or operational related consequence of the failure then corrective maintenance is probably the most effective mainte nance approach to be adopted The choice of an appropriate strategy for the failure management is guided by methodologies such as Reliability Centered Maintenance RCM 42 43 for military aviation and other applications or Maintenance Steering Group 3 MSG 3 46 for civil aviation 3 1 1 Monitoring and reasoning of failure precursors and loads The basis of health monitoring is built upon the premise that there exist precursor indications of failure in the form of some change in a measurable parameter signal of the system which can be correlated with a subsequent failure mode 9 47 Using this causal relationship it is assumed that failures can then be predicted with the correct approaches to reasoning The first step in health monitoring is to select the life cycle parameters to be monitored This can be done systematically through a FailureMode Event and Criticality Analysis FMECA For example a measur able parameter which can provide an indication of impending failure or a failure precursor for cables and connectors can include impendence changes physical damage or a high energy dielectric breakdown By monitoring changes in these precursors a system s heal
65. underlying cause will increase the severity of the intermittency until eventually a hard fault appears and the functionality of the system is compromised or lost 2 1 1 Printed circuit board interconnectors Information published by Gibson et al 19 claims that between 50 70 of all electronic device failures could be attrib uted to its interconnectors Even though solder joints can fail by a variety of mechanisms the device interface seems to be the most 198 S Khan et al Reliability Engineering and System Safety 123 2014 196 208 common cause Over time contaminations on the fractured surfaces initiate a failure sequence which starts with degraded joints and eventually progress to intermittent failures Products that have a dependency upon the behavior of inter facing devices for correct operation are also susceptible to faults which can be categorized as intermittent This is common in products that rely on software for their correct operation or interaction with other products In these cases they may exhibit periodic failures due to inherent incompatibilities between the system interfaces symptoms may include relative timing errors and synchronization issues The systems may not show any evidence of failure for many years of service but as the system interfaces become affected by wear and drift failures become evident This can result in a root cause misclassification with the root cause being diagnosed as compone
66. using specialized tools to help determine if the unit is a repetitive problem or if the problem is fundamentally an issue with the aircraft 93 In the case of airlines which are contracted into a spare parts pool utilized by several airlines the lack of tracking by design of units suspected of being rogue means that an airline has no information regarding any unit that they take from the pool Advanced tracking methods have begun to gain popularity particularly in the aircraft industry which is based upon RFID tracking for predictive main tenance 94 In the repair process multiple operations are con ducted to repair a complex engineered machine such as an engine which would include dismantling inspection repairing maintenance and re assembling Tracking and tracing of the status of these processes and operations provides critical information for decision making This tracking and tracing is often performed manually but the adoption of RFID as an automatic identification technology has the potential to speed up processes reduce recording errors and provide critical part history 95 The use of RFID technology to track units within a spare parts pool providing full service histories to the current user 96 has also provided the ability to reduce the number of NFF events identify ing rogue units in the spare parts pool reducing costs attributed to phantom supply chains The use of RFID technology over recent years has begun to be ta

No Fault Found events in maintenance engineering Part 2_ Root

Contents

Download Pdf Manuals

Related Search

Related Contents