Home

Post-print version - Curve

1. 3 2 1 Built In Test As electronic equipment evolve into ever more complex sys tems they increasingly depend upon BIT to provide in situ fault detection and isolation capabilities particularly in low volume electronic systems in the military aerospace and automotive sectors BIT is a coherent assortment of on board hardware software elements enabling a diagnostic means to identify and locate faults as well as error checking Its importance has there fore increased with system complexity as it enables equipment maintainability through better testability IEC 60706 5 58 In accordance to the ARINC 672 77 diagnostic testing should consider multiple level tests e g during operation and at dif ferent maintenance echelons Historically it is recognized that BIT had been designed and used primarily for in field main tenance by the end user but they are now used in evermore di verse applications which include oceanographic systems multi chip modules large scale integrated circuits power supply sys tems avionics and also in passenger entertainment systems for the Boeing 767 and 777 72 BIT is used to indicate system sta tus providing valuable information to locate the exact system components that need to be replaced and to indicate whether or not a system has been assembled correctly Failures reported by BIT tests can be costly and are likely to result in unit replacements recertification or inevitable loss of avai
2. fault isolation faulty component is identified and fault identification the nature of the fault is determined Prognosis is a prior event analysis and deals with failure prediction before faults occur making use of in situ sensors and physics of failure models 27 If it is pos sible to assess in situ the extent of degradation of electronic systems then such data would be invaluable in meeting the objective of providing efficient fault detection and identifica tion This would include evidence of failed equipment found to function correctly when tagged as NFF and hence improve maintenance processes extend life reduce whole life costs and improve future designs There is currently a drive in the majority of industries to turn away from the more traditional preventive and reactive mainte nance actions described above in favor of more predictive and proactive solutions 21 Condition Based Maintenance CBM is often regarded as the most advanced predictive maintenance strategy and hence could be aimed at reducing the number of machinery breakdowns by fault detection at an early incipient stage 5 10 36 CBM makes use of measurements of physical parameters while monitoring the trends over time any indica tion of abnormal behavior will trigger a warning In its sim plest form threshold warning levels are constructed to trigger maintenance activities when a specific parameter shows mea surements outside of the threshold regi
3. tion in the vehicle and the vehicle activity i e a fighter aircraft cruising or in a battle scenario These three conditions can be simulated with relative ease through the use of market avail able environmental chambers White and Richardson 2011 92 provide an overview of the differing types available and the variety of tests which can be carried out in them to inves tigate the event of NFF issues for aircraft assemblies In this research paper the authors also warn that environmental test ing is not the definite solution to identifying all faults There is also a need to get operational information which includes field data maintenance history and failure probabilities to determine if the failure in the unit is real or if it is in a different unit or even a false alarm However gaining this information can be tricky and would require additional work on behalf of pilots or operators in recording the events which led to the failure sig nal along with changes to procedural practices in maintenance record keeping or retrieval Often an overlooked area when considering an environmental test is the orientation of the UUT when embedded within its operating platform The orientation can mean that differing components are more affected by vi bration than if the UUT was in a different position and so the orientation of the UUT should be a consideration when under going environmental testing 4 1 2 Tracking Spare Parts The ability
4. in order to provide fault models and models that deal with false BIT alarms and the root causes of BIT deficiency In some in dustries and individual companies adopting better prognostics has ensured that important operational parameters are moni tored at all times to identify adverse and out of limits varia tions These technologies have helped to introduce a change from a policy of reactive maintenance to a predictive policy which would concentrate on providing vital information on the root causes of failures which is not provided with traditional BIT BITE Other technology improvements such as the use of RFID technology has been adopted to track units within the supply chain and to monitor the complete service history of items while they are in the supply chain Such technology so lutions will go some way to mitigating NFF but what is needed is a comprehensive approach dealing with organizational pro cedural and behavioral issues as well as all the technical issues The ability to map a NFF event from the initial reported failure through the entire maintenance process would provide invalu able information identifying the critical operations and proce dures which are failing From the literature research within this paper it is possible to identify the following core gaps in NFF failure related research 1 The Problem of Intermittency It is clear that intermittent fault occurrences are a major technical root cause of NFF and tha
5. normal or ex treme which result in fault symptoms manifesting themselves only under those specific conditions Examples include when temperature widely fluctuates or stress is applied in the form of vibration conditions which will not normally be present during laboratory testing Most products will undergo environmental testing to prove their reliability and robustness under the most extreme operating conditions as part of their certification pro cess but a more subtle set of environmental testing can also be used as part of the maintenance process which tries to simulate a more normal mode of operation In effect when designing for DfT information gathering exercises can be designed to study system behavior where such variation are present i e Design of Experiments DoE 53 These may provide essential statis tical information for planning experiments on process models in order to obtain data that can yield valid and objective conclu sions In any case there are three main environmental conditions which should be controlled for a good diagnostics test humid ity vibration and temperature However testing standards do not require these environmental factors to be done together 2 Each of these will depend on many factors for example temper ature and humidity will fluctuate with variables such as altitude time of year current weather patterns whilst vibration is depen dent upon such things as smoothness of roads runways loca
6. 1012 I E Commission IEC 60812 Analysis techniques for system reliabili tyProcedure for failure mode and effects analysis FMEA 2006 R Wright L Kirkland Nano scaled electrical sensor devices for inte grated circuit diagnostics Vol 6 IEEE Aerospace Conference 2003 p 25492555 L Mariani F Pastore M Pezz Dynamic analysis for diagnosing inte gration faults Software Engineering IEEE Transactions on 37 4 2011 486 508 N G Leveson Role of software in spacecraft accidents Journal of Space craft and Rockets 41 4 2004 564 575 A Brombacher E Hopma A Ittoo Y Lu I Luyk L Maruster J Ribeiro T Weijters H Wortmann Improving product quality and relia bility with customer experience data Quality and Reliability Engineering International 28 8 2012 873 886 L E Izquierdo D Ceglarek Functional process adjustments to reduce no fault found product failures in service caused by in tolerance faults CIRP Annals Manufacturing Technology 58 1 2009 37 40 R J Meseroll C J Kirkos R A Shannon Data mining navy flight and maintenance data to affect repair in Autotestcon 2007 IEEE 2007 pp 476 481 iD 1 A K Jardine D Lin D Banjevic A review on machinery diagnostics and prognostics implementing condition based maintenance Mechanical systems and signal processing 20 7 2006 1483 1510 R M Knotts Civil aircraft maintenance and support fault diagnosis from a business perspec
7. 1985 503 530 P D eon M Langley A Atamer Case based reasoning system and method having fault isolation manual trigger cases u s patent application 11 734 862 2007 R C Millar T Mazzuchi S Sarkani Application of non parametric sta tistical methods to reliability database analysis SAE Technical Papers A Atamer Comparison of fmea and field experience for a turbofan engine with application to case based reasoning in IEEE Aerospace Conference Proceedings Vol 5 2004 pp 3354 3360 cited By since 1996 2 C R Sharma C Furse R R Harrison Low power stdr cmos sensor for locating faults in aging aircraft wiring Sensors Journal IEEE 7 1 2007 43 50 C Lo C Furse Noise domain reflectometry for locating wiring faults Electromagnetic Compatibility IEEE Transactions on 47 1 2005 97 104 Y C Chung C Furse J Pruitt Application of phase detection frequency domain reflectometry for locating faults in an f 18 flight control harness Electromagnetic Compatibility IEEE Transactions on 47 2 2005 327 334 C Furse Y C Chung C Lo P Pendayala A critical comparison of reflectometry methods for location of wiring faults Smart Structures and Systems 2 1 2006 25 46 C R Parkey C Hughes M Caulfield M P Masquelier A method of combining intermittent arc fault technologies in AUTOTESTCON Pro ceedings 2012 pp 244 249 cited By since 1996 1 P A Smith D V Campbell A prac
8. A system view of the no fault found nff phenomenon Reliability Engineering amp System Safety 92 1 2007 1 14 I James D Lumbard I Willis J Goble Investigating no fault found in the aerospace industry in Reliability and Maintainability Symposium 2003 Annual 2003 pp 441 446 iD 1 V Challa P Rundle M Pecht Challenges in the qualification of elec tronic components and systems Device and Materials Reliability IEEE Transactions on 13 1 2013 26 35 B Sood M Osterman M Pecht Tin whisker analysis of toyotas elec tronic throttle control in CircuitWorld Vol 37 2011 pp 4 9 T Jin B Janamanchi Q Feng Reliability deployment in distributed manufacturing chains via closed loop six sigma methodology Interna tional Journal of Production Economics 130 1 2011 96 103 N M Vichare M G Pecht Prognostics and health management of elec tronics Components and Packaging Technologies IEEE Transactions on 29 1 2006 222 229 J K Line G Krishnan Managing and predicting intermittent failures within long life electronics in Aerospace Conference 2008 IEEE 2008 pp 1 6 iD 1 D A Thomas K Ayers M Pecht The trouble not identified phe nomenon in automotive electronics Microelectronics reliability 42 4 2002 641 651 J H Renner Reliability engineering an integrated approach at daimler chrysler in Integrated Reliability Workshop Final Report 1999 IEEE International 199
9. Proceedings of the Institution of Mechanical Engineers Part G Journal of Aerospace Engineering R Abreu P Zoeteweij R Golsteijn A J V Gemund A practical eval uation of spectrum based fault localization Journal of Systems and Soft ware 82 11 2009 1780 1792 V Sankaran A R Kalukin R P Kraft Improvements to x ray laminog raphy for automated inspection of solder joints Components Packag ing and Manufacturing Technology Part C IEEE Transactions on 21 2 1998 148 154 C Neubauer Intelligent x ray inspection for quality control of solder joints Components Packaging and Manufacturing Technology Part C IEEE Transactions on 20 2 1997 111 120 X Maldague Theory and practice of infrared technology for nondestruc tive testing Wiley Series in Microwave and Optical Engineering 2001 G Deng J Qiu G Liu K Lv A novel fault diagnosis approach based on environmental stress level evaluation Proceedings of the Institution of Mechanical Engineers Part G Journal of Aerospace Engineering 227 5 2013 816 826 R White B Richardson Anecdotal experiences on the value of limited environmental testing for the analysis of no fault found assemblies in AUTOTESTCON Proceedings 2011 pp 292 296 J Ramsey Special Report Avoiding NFF Avionics Magazine 2005 Y S Chang C H Oh Y S Whang J J Lee J A Kwon M S Kang J S Park Y Ung Development of rfid enabled aircraft maintenance
10. can use that information to determine how best to employ various diagnostics technologies e g BIT diagnostic reason ing ATE etc to detect failures in the future It seems that the role of having more specific standards solely focusing upon NFF mitigation might become much more prominent as they can promote best practice approaches within maintenance sec tors However solutions will not reside only within different maintenance echelons but should also focus on a much broader scope considering factors such as design manufacturing test ing organizational imperatives operator priorities technologi cal capabilities contractual agreements and financial manage ment This study highlights the fact that the majority of research that has been published primarily lies within aerospace pro ceedings such as IEEE publications and other engineering out lets Surprisingly there are no dedicated textbooks on the topic and the authors strongly feel that the maintenance community will benefit from its publication Also the authors advocate that the focus of published material needs shifting from the techni cal issues towards the business side This could be used as a opportunity to quantify the costs involved in NFF events and might influence the way contractual agreements are being setup now a days Each industry sector approaches NFF differently i e OEM maintenance suppliers and operators manufacturer etc When unplanned maintena
11. maintenance pressures and the shotgun effect 4 1 Detecting Blind Spots When it is suspected that NFF occurs due to a lack of fault coverage by the ATE or BITE there comes the requirement to use additional tools which are capable of identifying the root cause of the problem Ungar and Kirkland 2003 79 argue that to achieve this an understanding of the Physics of Failures PoF gt within the operating environment is needed Once this is known appropriate test equipment can be selected to support the ATE which through interpretation of the physics for ex ample of circuits under the test environment to be used as fault locators a capability often beyond that of standard ATE In fact Kimseng et al 1999 identified a PoF process to identify in duce and analyze not only failure mechanisms causing intermit tent failures but also high warranty returns and NFF problems of the digital electronic 85 As previously discussed many of the faults which contribute to NFF events in electronics are of an intermittent nature These usually provide a challenge 14__je the maintainer is left to troubleshoot the system using their best guess which will often result in the replacement and removal of modules that are perfectly good Physics of Failures PoF is a concept utilized to understand the processes and mechanisms that induce failure within a component This includes studying physical chemical mechanical electrical
12. modified designs for much more reliable parts 60 However most of the knowledge only resides within the heads of a few key experts or in personalized organizational databases which usually are consulted only after a problem has resisted several attempts at resolution Therefore on site ex perience must be blended with other diagnostic and prognostic tools and techniques 42 The obvious challenges here are 1 To store this experience based knowledge and deliver it at the time and place that the same problem symptoms occur so that it can be re used to help solve the problem on the first attempt 2 To deliver that knowledge in a form that is useful to experts and less experienced technicians alike 3 To share this knowledge so that everyone benefits from the experience of others 4 To integrate the knowledge access with the existing trou bleshooting tools so that it becomes part of the usual trou bleshooting workflow Human factors must be considered with respect to trou bleshooting performance 61 A diagnostic reasoning system could hence be useful to provide an such information along with high quality feedback to the design engineers 62 With the entry of symptoms the possible failure modes can be iden tified from the knowledge database and increasingly incisive information can be requested To the troubleshooter this can act as efficient guidance to the design engineer this can be an intel ligent interview autom
13. to recognisee rogue units is of paramount im portance in mitigating the effects of NFF events and to ensure operating safety particularly in the case of an aircraft The key to distinguishing a rogue unit is to implement the necessary procedures to track rogue units by serial number showing the date installed and removed the platform on which the unit was installed number of operating hours cycles number of hours since its last overhaul and a solid reason for the generated re moval codes In addition to this the history of the operating platform be that a wind turbine aircraft or train needs to be recorded with an easy to use retrieval system 2 The impor tance of such historical data is to aid in determining the exact effects the failure has on the overall system and whether the replacement of the unit offers a high level of confidence of rec tifying the problem Some airlines in the UK operate within a spare parts pool where the policy is that if a unit is returned to the pool labeled NFF more than three times then that unit will be scrapped This has the advantage that the spare parts pool will become less polluted with units which are rogue However this only encour ages the culture of accepting NFF and not searching out the root cause which may be a fundamental manufacturing flaw present in equivalent units such as a batch of faulty capacitors which have been used in the unit s production Likewise it could be a system de
14. which an item can be tested under stated conditions As more sophistica tion is added to electronic systems the ability to maintain them is becoming ever more difficult and costly Standard testing us ing Automatic Test Equipment ATE usually includes features such as timing signal strength duplicating the operating envi ronment loading fanout and properly interconnecting the Unit Under Test UUT 60 79 80 81 82 The idea of ATE is to force the UUT to fail without actually injecting faults The ability to do this is directly related to its testability Testability is a design related characteristic which if designed well will provide the capabilities to confidently and efficiently identify existing faults The number of tests and the information con tent of test results along with the location and accessibility of test points define the testability potential of the equipment The two attributes which must be met for testability success are 1 Confidence this is achieved by frequent and unambigu ously identifying only the failed components or parts with no removals of good items 2 Efficiency this is achieved by minimizing the resources required to carry out the tests and overall maintenance ac tion This includes minimal yet optimized man hours test equipment and training It is evident that the conventional ATE methods used within the maintenance line as required from the testability design are not successful 2 5 2
15. 1 83 They perhaps are not carrying the nec essary levels of confidence and efficiency or are inappropriate in the many industries which are suffering NFF difficulties If testability as a design characteristic was successful NFF would not be so problematic This is particularly evident in the case of attempting to detect and isolate intermittent faults at the test station The ability to test for short duration intermittency at the very moment that it re occurs using conventional methods is so remote that it will almost certainly result in a NFF The one ma jor issue with designing component testability is that the focus is on functionality and integrity of the system 46 Other difficulties with testability are that in most cases there is a complete lack of information regarding standardized tools for the evaluation of Design for Testability DfT For testa bility to be consistent within the design process to achieve the necessary levels of confidence and efficiency these standard 13There are design techniques that are added to obtain certain testability fea tures during hardware product design The premise of the features is that they can make it easier to develop and apply manufacturing tests and to validate that the product hardware contains no defects that could otherwise adversely affect the product s correct functioning e g boundary scanning definitions procedures and tools must be developed A testa bility evaluati
16. 11 N Vichare P Rodgers V Eveloy M Pecht Environment and usage mon itoring of electronic products for health assessment and product design International Journal of Quality Technology and Quantitative Manage ment 4 2 2007 235 250 V A Skormin V I Gorodetski L J Popyack Data mining technol ogy for failure prognostic of avionics Aerospace and Electronic Systems IEEE Transactions on 38 2 2002 388 403 R Karim O Candell P Soderholm E maintenance and information lo gistics aspects of content format Journal of Quality in Maintenance En gineering 15 3 2009 308 324 P O Larsson Kraik Managing avalanches using costbenefitrisk analysis Proceedings of the Institution of Mechanical Engineers Part F Journal of Rail and Rapid Transit 226 6 2012 641 649 D H Stamatis Failure mode and effect analysis FMEA from theory to execution Asq Press 2003 C S Byington P Kalgren B K Dunkin B P Donovan Advanced diag nostic prognostic reasoning and evidence transformation techniques for improved avionics maintenance in Aerospace Conference 2004 Pro ceedings 2004 IEEE Vol 5 IEEE 2004 L Y Ungar Testability design prevents harm IEEE Aerospace and Elec tronic Systems Magazine 25 3 2010 35 43 cited By since 1996 4 N M Morris W B Rouse Review and evaluation of empirical research in troubleshooting Human Factors The Journal of the Human Factors and Ergonomics Society 27 5
17. 22 The traditional approach not only fails to spot time dependant failures such as those ex hibited under vibration but could inherently ignore combina torial faults that occur due to wire to wire interactions Another issue is when chafed wiring occurs where a harness is routed through a structure that experiences high vibration levels Un less adequate protection such as cable clamps ties sleeving etc are provided the wiring bundle will brush the structure in such a way that damages internal wiring without external ev idence Such type of wiring faults are extremely difficult to detect and can lead to risk the maintenance crew rejecting prod ucts incorrectly which are associated with this particular signal path Wire breaks are common in harnesses and are likely to manifest as a hard fault for a period determined by the vibration and temperature profile However in order to correctly isolate the failure in an ambient environment stressing of the harness may be necessary to simulate the conditions in which the fail ure occurred In cases where fault is intermittent and the exact operating conditions are not known the failure may not be cor rectly attributed as being in the harness which will lead to the suspicion that the unit is at fault and requires replacing This is particularly true for those maintainers who operate within the constraints of fast turnaround times 2 2 Mechanical Systems The failure mechanisms wit
18. 9 pp 152 153 iD 1 H Qi S Ganesan M Pecht No fault found and intermittent failures in electronic products Microelectronics Reliability 48 5 2008 663 674 B G Moffat E Abraham M P Desmulliez D Koltsov A Richard son Failure mechanisms of legacy aircraft wiring and interconnects Di electrics and Electrical Insulation IEEE Transactions on 15 3 2008 808 822 G Huby No fault found Aerospace survey results copernicus technology ltd 2012 Tech rep Copernicus Technology Ltd 2012 2012 J Jones J Hayes Investigation of the occurrence of no faults found in electronic equipment Reliability IEEE Transactions on 50 3 2001 289 292 I J James Learning the lessons from in service rejection in Systems Reliability and Maintainability Ref No 1999 189 IEE Seminar 1999 pp 6 1 6 4 iD 1 A W Gibson S Choi T R Bieler K N Subramanian Environmental concerns and materials issues in manufactured solder joints in Elec tronics and the Environment 1997 ISEE 1997 Proceedings of the 1997 IEEE International Symposium on 1997 pp 246 251 iD 1 J Swingler The automotive connector The influence of powering and lu bricating a fretting contact interface Proceedings of the Institution of Me chanical Engineers Part D Journal of Automobile Engineering 214 6 2000 615 623 S Khan P Phillips Tackling no fault found in maintenance engineering in 1st Annual Symposium in No Fault Fou
19. In these cases they may exhibit periodic failures due to inherent incompatibilities between the system interfaces symptoms may include relative timing errors and synchronization issues The systems may not show any ev idence of failure for many years of service but as the system interfaces become affected by wear and drift failures become evident This can result in a root cause misclassification with the root cause being diagnosed as component ageing rather than the fundamental design issue with the interface Another major contributor to solder joint damage is thermal stress related to heat expansion shock and vibration During operation these stresses causes metal metal interconnects to rub against each other to damage any protective coating Such effects cumulate over time and will typically last for periods less then hundreds of nanoseconds Such manifestations frac ture the solder contacts and instigate intermittent faults Elec trical intermittency is also caused by contact fretting 15 20 Fretting corrosion occur particularly in tin plated contacts as a degradation mechanism caused by the presence of humidity which oxidizes the metal metal interface The accumulation of oxides at the contacts causes an increase in resistance and elec trical intermittency due to the repetitive sliding movements Other root causes of NFF events in electronics include creep corrosion and the phenomena known as tin whiskers 14 Creep corrosion
20. No Fault Found events in maintenance engineering Part 2 Root causes technical developments and future research Khan S Phillips P Hockley C and Jennions I Author post print accepted deposited in CURVE May 2015 Original citation amp hyperlink Khan S Phillips P Hockley C and Jennions 2014 No Fault Found events in maintenance engineering Part 2 Root causes technical developments and future research Reliability Engineering amp System Safety volume 123 196 208 http dx doi org 10 1016 j ress 2013 10 013 Publisher statement NOTICE this is the author s version of a work that was accepted for publication in Reliability Engineering amp System Safety Changes resulting from the publishing process such as peer review editing corrections structural formatting and other quality control mechanisms may not be reflected in this document Changes may have been made to this work since it was submitted for publication A definitive version was subsequently published in Reliability Engineering amp System Safety Vol 123 2014 DOI 10 1016 j ress 2013 10 013 2015 Elsevier Licensed under the Creative Commons Attribution NonCommercial NoDerivatives 4 0 International http creativecommons org licenses by nc nd 4 0 This document is the author s post print version incorporating any revisions agreed during the peer review process Some differences between the published version and this version ma
21. al B C Wadell Predicting and eliminating built in test false alarms Reliability IEEE Transactions on 39 4 1990 500 505 L Y Ungar L V Kirkland Unraveling the cannot duplicate and retest ok problems by utilizing physics in testing and diagnoses in AUTOTEST CON Proceedings 2008 pp 550 555 cited By since 1996 1 C Metra S D Francescantonio T Mak Clock faults impact on manu facturing testing and their possible detection through on line testing in Test Conference 2002 Proceedings International IEEE 2002 pp 100 109 P O Connor Testing for reliability Quality and Reliability Engineering International 19 1 2003 73 84 H Qingchuan C Wenhua P Jun Q Ping Improved step stress acceler ated life testing method for electronic product Microelectronics Reliabil ity 52 11 2012 2773 2780 J W Sheppard W R Simpson Applying testability analysis for inte grated diagnostics Design amp Test of Computers IEEE 9 3 1992 65 78 W Simpson B Kelly A Gilreath Predictors of organizational level testability attributes in Publicatoin 1511 02 2 4179 Annapolis Mary land ARINC Research Corporation 1986 K Kimseng M Hoit N Tiwari M Pecht Physics of failure assessment of a cruise control module Microelectronics Reliability 39 10 1999 1423 1444 D Guangian Q Jing L Guanjun L Kehong A stochastic automaton approach to discriminate intermittent from permanent faults
22. atically being applied anytime that these failures modes appear When completing the troubleshooting the maintainers can automatically report on the failure mode and record detailed differentiating symptoms Also this in formation can be of great importance for a Failure Reporting Analysis and Corrective Action System FRACAS proce dure providing valuable insights to engineers 42 64 3 2 Test Equipment Automatic Test Equipment ATE is widely used to perform device functional and parametric tests at the back end of the semiconductor manufacturing process 9 It is a capital in tensive system and typically costs 1 3M depending on the equipment performance An unscheduled equipment downtime lasting one hour could cause significant amounts of production loss Various reliability and maintenance databases can been compiled such as 63 eliciting information useful in scheduling maintenance and design activi ties 1OFRACAS Failure Reporting Analysis and Corrective Action System is a reactive procedure often utilized after failures have occurred within a system It is used to collect data report categorize analyze information and to plan corrective actions in response to those failures The Design World Failure Modes and Effects Analysis Design Engineers anticipating what will fail and preparing for it Built In Test Prognostic and Health Monitoring Functional Independence Mea
23. be incorporated and pro cessed at the Line Replaceable Unit LRU level Decentral ization of tests enable the ability to check the functionality of key circuits helping to identify problems much closer to the root causes than is the case in the centralized view making for a cost effective assembly and maintenance operations 43 The nature of BITs will be in some way dependent upon a set of pre defined statistical limits for the various parameters which are being monitored It is important to recognisee at this point that BIT will report failures for following two reasons 1 A specified parameter has exceeded a set threshold value 2 The noise of the BIT measurements throws the test results outside of the testing limits when the System Under Test SUT meets required specifications The first of these is a direct result of component failure for example a burnt out resistor The second occurs when a mea sured parameter which has noise is measured by an instrument having its own noise this is common in integrated manufactur ing processes digital system timings and radar systems 78 One of the areas of concern within these statistical limits is that they may have been inappropriately set without a true un derstanding of hardware software interactions or the nature of the equipment s operating environment This will therefore in evitably lead to BIT false alarms 3 2 2 Other Methods Some other techniques which have been propos
24. cal contributors to failures before the product is put into service 6 Concluding Remarks An important part of any new research subject is the design and maintenance of a reference collection of relevant publi cations To the best of the authors knowledge the performed study has moved the body of scientific knowledge forward by reviewing existing literature related to NFF and pointing out core gaps where current efforts should be focused on An at tempt is made to comprehensively review academic journal lit erature and conference proceedings on the topic The aim is to provide a general picture of the research areas undertaken in past few decades and create a database of the academic lit erature of journal publications on NFF concepts and its ap plications from 1990 to 2013 by classification and statistical analysis It is evident that the NFF phenomenon has gained the most attention in the last decade This is possibly due to in creasing system complexities reliability requirements and cost implications The article reported various occurrences and root causes that have resulted in NFF events Current industrial practices were discussed whilst highlighting the importance of capturing and sharing as much information as possible to support rapid diag nostics and troubleshooting workflow Furthermore emphasis was placed on the importance of having feedback mechanisms to transfer maintenance event information to design engineers who
25. e status will be detected this way Also measuring fractions of a milliohm and attempting to take meaningful action based on these values is extremely difficult time consuming and requires precise control in the test set up and test environment Appropriate test equipment is required to address the inter mittency issue and to resolve all of the variables causing this unpredictability providing the maintainer with a quick and com prehensive route to a successful outcome Overcoming the test ing challenges posed by intermittent problems require a differ ent approach to that of using conventional digital equipment predicated on accuracy of measurements and time consuming results analysis Truly effective and practical detection of in termittency requires improved test coverage and consequently vastly improved probability of detection There are also a variety of other high profile integrity testing methods currently being championed Most notable of these are the use of X ray and thermal imaging X ray inspections that can highlight shorts or coupling faults buried within the layers of multiplayer printed circuit boards non invasively Sankaran et al 1998 88 discusses the use of X ray laminogrophy for accurate measurements of solder joint structures through 3D image reconstruction using artificial neural networks Auto mated inline systems based on X ray transmission have sev eral advantages over optical inspection Optical inspec
26. ed include 1 DC resistance Traditionally these techniques have been utilized to monitor the reliability of electronic compo nents as it is well suited for identifying electrical conti nuity However these methods do not often provide any early indication of failure of physical degradation and may not be sensitive enough for future electronics that op erate at higher frequencies 2 RF impedance Kwon 2010 72 worked on developing an RF impedance method to provide an early indication of interconnect failures The technique has better sensi tivity towards degradation as compare to its DC coun terpart due to the phenomenon known as the skin effect The method takes advantage of the surface concentration of high speed signals depending on the material charac teristics being passed through the connection whilst mon itoring the frequency response 12A Line Replaceable Unit LRU level is the lowest level when a modu lar or sub unit item of the system can be easily replaced and quickly inter changed 3 Functional process methodology In order to eliminate warranty related NFF events Izquierdo and Ceglarek 2009 33 demonstrated a methodology based on design tolerances that integrate service or warranty data with manufacturing measurement and existing product models 4 Improvements in Test Abilities Testability as defined by IEC 60706 5 72 is a quantitative design characteristic which determines the degree to
27. ems notably BIT and prognostic strategies to keep track 2 BIT If BIT s were 100 comprehensive and unambigu ous at the aircraft level including interacting systems 34 then it would i Detect every possible problem ii Point with certainty to the defective part and only where the problem was caused by a defective part as opposed to operator mishandling environmental cir cumstances etc But to the extent that BIT is lacking troubleshooting is required 3 Troubleshooting In theory if Fault Isolation Manuals FIM or troubleshooting guides were perfect then ev ery failure that can occur on any aircraft would be swiftly and correctly identified by any maintenance personnel following step by step procedures However FIM fails to identify the problem the maintainers rely heavily on their experience 5 Other resources are often used to help escalation channels technician training supporting docu mentation etc 4 On site or practical feedback To close the loop with re liability new system failure modes are often discovered adding to the troubleshooting difficulties 26 and acts as a source of feedback to design engineering for reliability improvements 3 1 Health and Usage Monitoring Condition Based Maintenance CBM programmes can be aimed at either fault diagnostics or prognostics gt 35 Diag nostics refers to a posterior event analysis and deals with fault detection indicates a fault has occurred
28. en involved with improving maintainability particularly in the airline industry when dealing with legacy aircraft The more general issues include 39 1 Any technological enhancements must work within exist ing architectures 2 The information available from lower test levels are typi cally predefined and costly to improve or change 3 Hardware development can be costly and outweigh poten tial cost saving benefits 4 There may be limited space for additional processing ca pabilities to support improved diagnostics However the authors would like to emphasize that if there are no safety or operational related consequence of the fail ure then corrective maintenance is probably the most effective maintenance approach to be adopted The choice of an appro priate strategy for the failure management is guided by method ologies such as Reliability Centered Maintenance RCM 42 43 for military aviation and other applications or Main tenance Steering Group 3 MSG 3 46 for civil aviation 3 1 1 Monitoring and Reasoning of Failure Precursors and Loads The basis of health monitoring is built upon the premise that there exist precursor indications of failure in the form of some change in a measurable parameter signal of the system which can be correlated with a subsequent failure mode 9 47 Us ing this causal relationship it is assumed that failures can then be predicted with the correct approaches to reas
29. ential suitable method for monitoring solder joint fatigue inside of the packaging of power modules Bhatia et al 2010 71 have used this principle as a basis to develop and test a new solder joint fault sensor known as the SJ Monitor which provides the ability to monitor selected I O pins of powered off FPGA s The use of RF impedance is also used as a failure pre cursor and offers interesting prognostic capabilities for solder joint failures due to the nature of gradual non linear increases in impedance as damage increase whereas the DC resistance becomes constant The use of RF impedance is researched at length by Kwon 2010 72 who demonstrates prognostic ca pabilities which are able to predict the remaining useful life of the solder joint with an error less than 3 The research also demonstrates the ability to distinguish between two competing interconnects failure modes solder joint cracking and pad cra tering the need for such failure distinctions in this case however is unclear The use of embedded molecular test equipment within ICs enabling them to continuously test themselves during normal operation providing visual indications of failure has been pro posed by GMA Industries as one of the more advanced and fu turistic monitoring technologies 29 The sensors are used to measure electrical parameters and various signals such as cur rent and voltage as well as sensing changes in the chemical structure of integrated circu
30. hin a mechanical system are widely regarded as having less of an effect upon the rate of NFF occurrences than those which are present within electrical sys tems The causes of failure in mechanical systems are similar to those in electrical systems such as ageing poor maintenance incorrect installation or usage The difference however is that it is much easier to predict the effect upon the systems operation with mechanical failures As a result this allows inspection cri terions to be developed during the design phases 23 It should be noted that as with many electrical failures mechanical fail ures can be intermittent in nature and only occurring under spe 3__also tin whisker growth is much more likely in lead free solder to cause short circuits 21 cific operating conditions Some of the more common mechan ical failures which are of interest but receive a lot less attention then the electrical failures which contribute to diagnostic failure are 1 Broken seals and leaks Leaks from broken seals will af fect the operation of items which include engines gear boxes control actuators and hydraulic systems The nature of seal design is that they are often designed to slightly weep This is a good example of the need for maintenance personnel to be familiar with the system and hence be aware of what constitutes acceptable leakage in order to avoid unnecessary removals 2 Degradation of pneumatic and hydraulic pipes Degrada t
31. ibration time history was recorded throughout all stages of the shuttle s mis sion and used with physics based damage assessment models to predict the health and time before the next expected electronic failure A similar methodology was applied to the end effector electronics unit inside the space shuttle s remote manipulator systems robotic arm 52 In this case loading profiles for both thermal and vibrational loads were used with damage models inspections and accelerated testing to predict the component in tegrity over a 20 year period Lall et al 2007 53 presented a methodology to calculate prior damage in electronic intercon nects operating in harsh environments and hence subjected to highly cyclic and isothermal thermo mechanical loads with as sessment predictions in good correlation with experimental data using a health monitoring tools Understanding electronics from a system point of view rather than a set of individual components is claimed by VEX TEC Corporation to be paramount to developing life cycle prognostic models as part of a failure reduction methodol ogy 11 The proposed methodology has far reaching conse quences on how the operators can manage a fleet of aircraft based upon risk rather than guessing degradation levels It is argued that by doing this NFF failure events can be reduced by the ability to prioritise the order of components replaced dur ing a reported failure event based on probabilities De
32. ich is highly challenging and is becoming even more important due to increasing complexity and criticality of technical systems Part 1 introduced the fundamental concept of unknown failures from an organizational behavioral and cultural stand point It also reported an industrial outlook to the problem recent procedural standards whilst discussing the financial implications and safety concerns In this issue the authors examine the technical aspects reviewing the common causes of NFF failures in electronic software and mechanical systems This is followed by a survey on technological techniques actively being used to reduce the consequence of such instances After discussing improvements in testability the article identifies gaps in literature and points out the core areas that should be focused in the future Special attention is paid to the recent trends on knowledge sharing and troubleshooting tools with potential research on technical diagnosis being enumerated Keywords No fault found test equipment troubleshooting failures fault diagnostics maintainability testability 1 Introduction Part 1 extensively discussed the organizational complexities and challenges faced by businesses today in attempts to admin ister solutions to the problems caused by unidentified failures It also described the applied method for collection and analysis of the referenced literature in detail This was included not only to judge the validity of these pa
33. ion and meet specific requirements However since stan dards and guidelines are prepared to be generic they only briefly consider the handling of any malfunctions caused by SN cer ae IS 4FMEA Failure Mode and Effects Analysis is recognized as one of the most effective methods to identify and remove critical reliability issues This procedure is commonly used to influence the system design before it is commis sioned enumerating potential failure modes that may occur during operation These are proactively performed to assess the impact of various failure modes during the product development and maintenance stages 14 Risk priority numbers can also be assigned to each of the failure modes based on factors such as detectability severity and occurrence software faults and their effects in FMEA 29 Software com ponents are often delivered with little access to the source code which only provides a partial view of their internal function ality With restricted access in these Off the Self OTS so lutions unpredictable effects and integration faults are likely to undermine critical software functions which can be difficult to diagnose and locate 30 Investigations into failures within aerospace missions have highlighted critical failures that are due to such components along with incomplete software spec ifications 31 Many of the reported issues in this paper can be attributed to complacency and misunderstanding of software f
34. ion within pipes often occurs due to corrosion or fret ting against other components or structures The nature of pneumatic hydraulic systems is that under pressure they may develop small leaks These minor leaks may result in an alarm to the operator indicating failure resulting in the unwarranted shut down of the system when no equipment malfunction has actually occurred 3 Backlash in mechanical systems One area where backlash can cause significant concern is within actuation systems particularly those used for aircraft control surfaces It is possible that with excessive wear in actuator couplings position sensors may indicate incorrect operation includ ing asymmetric settings which are difficult to isolate from a maintenance perspective 2 3 Software Systems It is clear that a great deal of NFF occur in avionics electri cal and electro mechanical systems however research discus sions have also revealed that software including Built In Tests BIT is also a key contributor to the problem 5 24 25 26 This includes Processing delays Discrepancies between software testing procedures Timing errors Lack of appropriate training Perhaps a poorly written program code Industry specific standards exist such as IEC 62278 27 for railways or the IEC 60812 28 is often referred to when car rying out Failure Mode and Effects Analysis FMEA for soft ware based systems that can be used to validate software op erat
35. is a mass transport process in which solid cor rosion products migrate over a surface on Integrated Circuit IC packages and eventually result in electrical shorts or sig nal deterioration due to the bridging of corrosion products be tween isolated leads Depending on the nature of corrosion 2These failures can occur under several scenarios a common failure is where surface mount packaging used are knocked off during socket insertion product conductive or semi conductive dry or wet the insu lation resistance can vary thus potentially causing intermittent loss of signal integrity A pure tin finish is well known to pro duce conductive metal whiskers that are capable of producing unintended current paths These failures usually appear inter mittently making it difficult to identify them as a root cause to the problem they are easily broken off and can melt to re move a previously existing short 8 In the case of a reported failure where there is no hard or definite symptom for a suf ficient fault diagnosis there will be the need for additional tech nical data or specialist technical knowledge This can be in the form of maintenance history troubleshooting guides or exper tise from experienced colleagues and specialists 2 5 2 1 2 Harness Wiring A key aspect of interconnect and wiring related failures is that they will often not be detected by traditional one path at a time sequential mode of analysis
36. iticality analysis FMECA is an extension of FMEA 48 of the product to reduce its service life Suppliers and op erators particularly within the airline industry spend signifi cant resources attempting to determine the root causes of the NFF events but without any measured field conditions a root cause analysis can be problematic for capturing information This poses an even more significant challenge that requires ad ditional specific sensing equipment and data loggers Burns et al 2002 50 demonstrate the development laboratory and in flight testing of such specific equipment for monitoring the environment of aircraft avionic power system The equip ment termed the Aircraft Environment Monitor Power Quality AEM PQ allows over two years of continuous data measure ments to be collected for evaluation of the quality of power sys tems for different operational scenarios The hardware and data gathered is a prime example of the information gathering abil ities which are required to evaluate the influence of life cycle loads on a specific mission critical system The added bonus of this data is that it provides the foundations to troubleshooting NFF s which can aid in re evaluating system avionic design and establishing models for life cycle analysis Life cycle monitoring has been used to conduct prognostic Remaining Useful Life RUL estimates of circuit cards inside of a space shuttle s solid rocket booster 51 V
37. its that are indicative of developing failure modes The basic structure of the sensors are carbon nanotubes and the integration of these sensors with conven tional IC s along with molecular wires for the interconnecting sensor networks is the important focus of this research How ever no details of demonstrable in service products or proto types are given and to date no research paper offering proofs on the applicability of the concept has been found Recently a sensitive analyzer was introduced by Universal Synaptic to simultaneously monitor test lines for voltage vari ation and seems to have become an attractive tool for detec tion of the intermittency 73 74 Conducting the intermittency test simultaneously provides an increase in probability of detec tion combined with the reduction in the time taken to complete the test because the testing is performed for multiple points rather than testing one line at a time means that this is po tentially an effective test methodology It has been used on the F 16 AN APG 68 Radar system Modular Low Power Ra dio Frequency MLPRF unit where 36 million dollars worth of assets previously deemed unrepairable have been returned as serviceable The equipment has also shown considerable promise in the UK military on the Tornado and Sentinel aircraft fleets 2 Other similar work on intermittent fault detection has been done by Muja and Lamper 2012 75 and Smith et al 2009 76
38. lability of the equipment 1 Even though these checks may be designed as a means to detect and locate equipment faults there are a variety of shortcomings which contribute to the NFF phenomena Many experts advocate that the design of a BIT system is a non trivial task and rely deeply on the knowl edge of all the system interactions 5 43 Due to this it is often difficult to define a fixed set of test procedures that can verify the full functionality of a component This has led to log reports containing spurious fault detection For example op erator pilot reports of faults often do not always correspond to the test logs resulting in overlooked maintenance issues Also even with the sophistication of modern tests there is still a ma jor issue of removed units reported by the test to be at fault but upon testing being found to have no faults or even faults that do not correlate to the BIT reports As well as the false alarm 1l__this has been discussed in Part 1 Section 4 issue other factors such as assessment coverage and inappro priate parameter limits can in turn contribute to NFF events 2 Assessment coverage deals with the nature of the BIT which could be designed in several different ways making the checks dependent on the monitored equipment and system scale A system wide BIT will either be centralized where dedicated hardware is used to control all functions or decentralized where a number of test centers can
39. mittent faults with exception to open and short circuits Also promising are reflectometry methods that are proving to be useful when applied to locating intermittency in an F 18 flight control harness 67 they do require exceptional accu racy in baseline comparisons In civil and military aerospace recording and maintaining TDR data archives for even a lim ited number of circuit s may prove to be enormous and costly 68 Another technique called spread spectrum time domain reflectometry SSTDR is commercially being used to identify faults in electrical wires by observing reflected spread spectrum signals Parkey et al 69 CMOS Integrated Circuits IC are routinely tested using supply current monitoring which is based upon the knowledge that a defective circuit will produce a significantly different amount of current than fault free circuits Smith and Campbell 2000 70 have developed an in situ quiescent current moni tor that detects in real time elevations in the leakage current drawn by the IC whilst in a stable state Other similar current monitors have been reviewed by Pecht 2006 43 Damage to electronic solder joints are a major contributor to intermittency in electronics and hence are a direct contributor to the NFF phe nomena Damaged solder points are notoriously difficult to de tect without extensive visual inspections They do however pro duce large variations in thermal resistance which can be used as a pot
40. nce regimes are initiated the costs along the supply chain warranty downtime operational fines are expected to raise concerns In either case researchers and scientists should target to publish NFF related research in management and business journals to emphasize its importance This will help to promote knowledge in addition to overcoming barriers in NFF investment and the lack of a business case due to no standardized methods or metrics for costing impacts 6 1 Future Perspectives The core areas where efforts should be focused on 1 Establishing a consistent NFF taxonomy 2 Failure Knowledge Bases novel FMEA tools and trou bleshooting guides specific for NFF to improve diagnostic success rates 3 Development of assessment tools to assess maintenance capability or effectiveness which may include i Recording and cross referencing test station configura tion and performance statistics with NFF occurrences This includes statistics on equipment calibrations Ensuring that the testing environment is correct and investigations into whether testing procedures need modification to consider multiple environmental fac tors humidity temperature vibration etc simultane ously Introduction of integrity testing as complimentary to stan dard ATE functional testing procedures i Integration of on board health and usage monitoring ii Standardization for intermittent testing and procedures for dealing with intermittent faul
41. nd 2013 W Shawlee D Humphrey Aging avionics what causes it and how to respond Components and Packaging Technologies IEEE Transactions on 24 4 2001 739 740 S Khan P Phillips C Hockley I Jennions Towards standardisation of no fault found taxonomy in Ist International Through life Engineering Services Conference 2012 2012 pp 246 253 L Warrington J A Jones N Davis Modelling of maintenance within 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 discrete event simulation in Reliability and Maintainability Symposium 2002 Proceedings Annual IEEE 2002 pp 260 265 G Ramohalli The honeywell on board diagnostic and maintenance sys tem for the boeing 777 in Digital Avionics Systems Conference 1992 Proceedings IEEE AIAA 11th IEEE 1992 pp 485 490 I Beniaminy D Joseph Reducing the no fault found problem Contri butions from expert system methods in Aerospace Conference Proceed ings 2002 IEEE Vol 6 2002 pp 6 297 1 6 2973 vol 6 iD 1 J Xie M Pecht Applications of in situ health monitoring and prognostic sensors in The 9th Pan Pacific microelectronics Symposium Exhibits and Conference 2004 p
42. ng Federal Aviation Administration FAA certifications for RFID track ing systems aimed as a standard component on all new 737 777 and 787 commercial aircraft as well as a variety of their military aircraft Similarly Airbus is also promoting the adop tion of RFID in the aircraft industry and are developing RFID part tracking systems for their new A400M military transport plane as well as for the A380 commercial jet 98 5 Discussion on Gaps in Literature In the past few decades there has been a great deal of re search in order to address the NFF issue but solutions to miti gate the problem are certainly not universal even within some individual organizations let alone across a common industry sector Some of this effort is being directed at the design and production stages where there is a need to create more fault tolerant systems which perhaps incorporate in built redun dancy or self testing mechanisms Also there is a requirement for some thorough research effort into understanding intermit tency Understanding intermittent faults will rely on the abil ity to describe the various interactions accurately and how me chanical software and electronic elements all have to interact together Modeling of intermittent faults will be required but will need to include probabilities of fault detection and the ef fects intermittent failures have on other dependant systems A thorough understanding of individual systems will be required
43. nowledge over time can as sist designers in improving the reliability of the equipment At the core of the challenge for better troubleshooting is the difference between anticipated failures captured within the de sign and the actual failures that appear in service When com plex equipment is designed engineers typically identify the po tential failure modes and their effects on the system using a FMEA With this information it can be determined how best to employ On Board Diagnostic or BIT technologies to detect failures These can implement Prognostics and Health Moni toring PHM strategies to detect impending functional failures In addition this can also prepare troubleshooting procedures in advance for analyzing the functionality of the system in order to differentiate among the many possible root causes of these anticipated failures Procedures are contained in troubleshoot ing manuals or guides which require human involvement to exe cute the tests and evaluate the results As good as they are these systems are often far from perfect nor should they be expected to be given the necessary practical cost performance tradeoffs 5 57 Furthermore existing RCM standards such as IEC 60812 29 FMEA IEC 60300 3 11 42 SAE JA1012 43 and experts related to FMEA Moubray 1997 41 Stamatis 1995 58 emphasize the importance of continuously updat ing them and making sure that it is a living document tha
44. of these pro cesses and operations provides critical information for decision making This tracking and tracing is often performed manually but the adoption of RFID as an automatic identification technol ogy has the potential to speed up processes reduce recording errors and provide critical part history 95 The use of RFID technology to track units within a spare parts pool providing full service histories to the current user 96 has also provided the ability to reduce the number of NFF events identifying rogue units in the spare parts pool reducing costs attributed to phan tom supply chains The use of RFID technology over recent years has begun to be taken very serious by major aerospace manufactures such as Messier Dowty for use in future landing gear health manage ment systems and the world s two dominant airlines Boeing and Airbus In 2005 Boeing announced that in order to improve its ability to track and maintain service histories of its parts it would require many suppliers of high value parts to its new 787 Dreamliner aircraft to place RFID tags on all parts before shipping them to Boeing Even though RFID tagging is consid ered an expensive option Boeing argues that for the additional cost of 15 per tag for a 400 000 primary flight computer the life cycle information gained would more than justify the addi tional expenditure to their customers 97 In early 2012 Boe ing Commercial Aviation Services were still awaiti
45. on should not only provide predictions but also redesign information when testability attributes are predicted to be below the acceptable levels There are three testability at tributes which can be identified 84 1 Fraction of Faults Detected FFD Ideally this should be 100 Any fault not detected by either the BIT BITE or ATE can result in total loss of the system in tegrity and hence functionality In reality some faults not safety mission critical can be tolerated and so a FFD less than 100 may be acceptable when designing for testabil ity 2 Fraction of Faults Isolated FFI If a detected failure is not isolated quickly and efficiently with high confidence then the system may end up being kept out of operation for significant periods of time The result of this leads to pressure on maintenance personnel who are then likely to adopt the shotgun approach of speculative LRU re placements adding pressure and complications to the spar ing and logistics processes increasing life cycle costs Ap propriate measures of FFI include Mean Time to Fault iso lation MTFI Mean Time to Repair MTTR and rates of NFF 3 Fraction of False Alarms FFA or Rate of False Alarm RFA This is a measure of the rate at which detected faults results as a false alarm upon investigation It is com puted as a time normalized sum of false alarms where the normalization is either calendar time or operating hours High FFA will also lead to
46. oning The first step in health monitoring is to select the life cycle parameters to be monitored This can be done systematically through a Fail ure Mode Event and Criticality Analysis FMECA For exam ple a measurable parameter which can provide an indication of impending failure or a failure precursor for cables and con nectors can include impendence changes physical damage or a high energy dielectric breakdown By monitoring changes in these precursors a system s health status and additional prog nostic information can be evaluated and unexpected failures could be avoided A summary of potential failure precursors for electronics is defined by Born and Boenning 1989 49 The life cycle environment of a product consists of manu facturing storage handling operating and non operating con ditions which may lead to physical performance degradation Reliability Centered Maintenance RCM is a structured approach to en sure that assets continue to do what their users require in their present operating context 40 41 7Maintenance Steering Group 3 MSG 3 based maintenance provides a top down approach to determine the most applicable maintenance schedule and the interval for an aircraft s major components and structure The method ology effectively delivers significant improvements in an aircraft s availability and operational safety whilst optimizing the costs of ownership 44 45 8Failure mode effects and cr
47. ons In corrective main tenance much of the time is spent on locating a defect which often requires a sequence of disassembly and reassembly Re cently condition monitoring of railway wheels with NFF prob lems was investigated by Granstrom and and Soderholm 2009 37 The authors provided a perspective on how such tech nologies can be applied and utilized for more effective and effi cient maintenance management while initiating a discussion on the maintenance requirements of systems and the management regimes which are forced onto those systems The ability to automate fault diagnosis with advanced technologies and tech niques could be used to accurately predict the downtime and hence the operational availability In fact the role of diagnos ability analysis in modern systems considering their complex ities and functional interdependencies becomes significant as 5__there are other maintenance programmes that do not consider diagnos tics or prognostics e g in time based preventive maintenance where replace ment of parts is performance after a predetermined time interval measured by a relevant time measure e g hours cycles or tonnages independent of the condition it improvements can lead to a reduction of a system s life cycle costs 38 However it should be noted that such setups are only worthwhile if the benefits can significantly outweigh the costs of its introduction and upkeep There are design constraints of t
48. or thermal aspects which influence the performance of the component over time until it eventually fails to meet any system requirements to signal processing algorithms which are often designed with permanent faults in mind 86 Some work on resolving such issues have been carried out using algorithms that make use of Bayesian networks to decompose large systems containing multiple components that may potentially fail during operation 87 Such probabilistic approaches often prove useful for study the performance behavior of underperforming subsystems that eventually lead to a system failure Typical circuits are usually tested one at a time or just a few circuits at a given time and unless the intermittent fault occurs within the time window of the test the fault will go undetected 74 This is compounded further by digital averaging of results which indicates that con ventional testing equipment do not provide effective test cover age for intermittency one of the major drivers for NFF Other alternatives to address the intermittency problem which try to use traditional measurements include methods such as tracking and comparing circuits down to fractions of a milliohm one circuit at a time against long running records of similar measurements However there are some major lim itations to this approach when an intermittent circuit is in a temporary working state it will generally pass such tests and only those approaching hard failur
49. pers but also to present a statis tical analysis of the academic journal publications on NFF con cepts between the period 1990 2013 In addition the authors had categorized the literature into four main areas fault diag nostics system design human factors and data management where it was noted that fault diagnostics and system design have been the main focus for NFF journal publications within the past two decades Part 1 also focused on No Fault Found NFF standards and how such events can cause unprecedented changes in the service performance impact dependability and escalate safety concerns This has long been revealed with a va riety of products within a wide range of industries 1 2 3 4 This paper aims to elaborate on these outlooks from Part 1 whilst examining the technical aspects for complex systems and equipment particularly products integrated within aircraft computer systems and how such events can have a significant effect upon the overall unit removal rate Historically such re movals have been seen as an unavoidable nuisance 5 but this viewpoint is no longer acceptable if the unit removal rate is to be managed effectively 6 7 Unlike those failures that re sult in Confirmed Faulty events the designer may have no Corresponding author Tel 44 0 1234 75 0111 E mail address samir khan theiet org Preprint submitted to Reliability Engineering and System Safety direct influence on those aspec
50. radation detection of electronic products Microelectronics Relia bility 52 2 2012 439 445 A H M Rausand System reliability theory models and statistical methods Chapter 3 Wiley 2009 F H Born R A Boenning Marginal checking a technique to detect incipient failures in IEEE Proceedings of the National Aerospace and Electronics Conference Vol 4 1989 pp 1880 1886 cited By since 1996 2 D J Burns K D Cluff K Karimi D W Hrehov A novel power quality monitor for commercial airplanes in Conference Record IEEE Instru mentation and Measurement Technology Conference Vol 2 2002 pp 14 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 1649 1653 S Mathew D Das M Osterman M Pecht R Ferebee Prognostics as sessment of aluminum support structure on a printed circuit board Jour nal of electronic packaging 128 4 2006 339 V Shetty D Das M Pecht D Hiemstra S Martin Remaining life as sessment of shuttle remote manipulator system end effector in Proceed ings of the 22nd Space Simulation Conference 2002 p 2123 P Lall M Hande C Bhat J Suhling J Lee Prognostic health moni toring phm for prior damage assessment in electronics equipment under thermo mechanical loads in IEEE Electronic Components and Technol ogy Conference 2007 p 109711
51. riation without eliminating the cause October 2 2013 The remainder of the paper is structured as follows after identifying the common root causes for NFF in system com ponents the brief survey s some industry specific innovations that have been introduced in order to capture troubleshooting data Section 4 discusses improvements in test capabilities fol lowed by a discussion on the identified gaps in NFF literature Finally concluding remarks and future directions for research into testability methods and the necessary design guidance to mitigate the problem are cover in Section 6 2 No Fault Found Occurrences in Systems 2 1 Electronic Systems Electronic failures are not often considered as static nor ran dom or pseudorandom events but rather the result of mechan ical and material changes 9 10 These changes seldom lead to a loss of functionality of an electronic system even though their components maybe out of specification This is due to the elec tronics having an inherent self compensating aspect that makes the task of failure diagnostics difficult and directly contributes to a successful diagnosis In addition degradation of failure modes often manifest differently depending upon the operating environment that may offset components and the circuit con figuration 11 Thomas et al 2002 12 and Renner 1999 13 investigated the root causes of NFF in automotive elec tronic systems It was revealed that an o
52. sign flaw leading to integration faults Either way scrapping units in this way will inevitably lead to an increase in costs 5 Other airlines routinely tag and track units that come back with similar reported failure symptoms multiple times These tagged units are then subjected to special testing that is not usu ally required such as thermal shock and environmental tests Units tagged as rogue are also tracked by the tail number of the aircraft from which they came Technicians then monitor and track repetitive serial numbers using specialized tools to help determine if the unit is a repetitive problem or if the problem is fundamentally an issue with the aircraft 93 In the case of airlines which are contracted into a spare parts pool utilized by several airlines the lack of tracking by design of units sus pected of being rogue means that an airline has no information regarding any unit that they take from the pool Advanced tracking methods have begun to gain popular ity particularly in the aircraft industry which is based upon RFID tracking for predictive maintenance 94 In the repair process multiple operations are conducted to repair a com plex engineered machine such as an engine which would in clude dismantling inspection repairing maintenance and re 6Units which have been taken out and sent back for repair multiple times are tagged as rogue units 11 assembling Tracking and tracing of the status
53. standard in trou bleshooting guidance is the Fault Isolation Manual These manuals can be costly to produce and maintain within a dynamic environment and are often tied to the technical publications cycle usually meaning several months be tween updates Depending on organizational and cultural factors it might not be effective to put all the troubleshoot ing knowledge in a paper based or electronic guidance format and hence a diagnostic reasoning engine might be an effective system to implement 42 Achieving Diagnostic Success In order to improve diag nostic success rates improvements need to be made to pro 12 cesses procedures and technology which have failed Ini tial research shows that work towards this goal is patchy and there is definitely more to do There is almost certainly not one universal industrial solution The current key ar eas for NFF mitigation are focused around understanding test coverage represented by BIT BITE ATE deficiencies development of new maintenance troubleshooting tools techniques and concepts as well as changes to manage ment processes Accurate fault models fault event trees and system understanding are paramount to recognizing false BIT alarms caused by such things as a sensor sys tem synchronization Also new systematic tests should be identified in the product design These tests would aim at allowing multiple testing of stressors identifying weak nesses and flaws and the criti
54. surement User Manual 1 On site feedback to design The Practical World Failure Reporting Analysis and Cottective Action System Operators and maintainers experiencing what actually fails and recognizing it Figure 1 Troubleshooting Anticipated vs Actual Faults The use of reflectometry has commonly been used to de termine the integrity of cables and wiring with effective lo calization of intermittent faults such as open or short circuits These methods send a high frequency signal down the line which reflects back at impedance discontinuities The location of the fault is determined by the phase shift between the inci dent and reflected signals Sharma et al 2007 65 demon strates a novel architecture for implementing a Sequence Time Domain Reflectometry STDR method which uses a pseudo noise code to locate open and short circuits on active wires us ing an integrated CMOS sensor The approach has an accuracy of fault localization of 1ft with low power consumption for the sensor Lo and Furse 2005 66 provide research into simi lar faults but using a differing kind of reflectrometry known as Noise Domain Reflectrometry NDR which make use of exist ing data signals in the wiring With this method results show the potential to localism intermittent faults within 3 inches in 180ft of electrical wiring However caution must be taken when us ing these methods as little is known on the impedance profile of inter
55. sys tem in Industrial Informatics 2006 IEEE International Conference on IEEE 2006 pp 224 229 W He C Xu Y Ao X Xiao E W Lee E L Tan Rfid enabled handheld solution for aerospace mro operations track and trace in Emerging Tech nologies amp Factory Automation ETFA 2011 IEEE 16th Conference on IEEE 2011 pp 1 8 A Narsing Rfid and supply chain management an assessment of its eco nomic technical and productive viability in global operations Journal of Applied Business Research JABR 21 2 2011 1 6 M O Connor Boeing wants dreamliner parts tagged RFID Journal 2005 15 98 M Roberti Boeing airbus team on standards RFID Journal 2004
56. t reflects new knowledge and gained experiences This impor tance of continuous improvement is also emphasized by related standards such as IEC 60300 3 14 53 and EN 50126 27 or IEC 62278 52 It should be highlighted that FMEA analy sis directly contributes to the development of effective mainte nance procedures e g RCM and MSG 3 in the aircraft industry incorporate FMEA as the primary component of analysis as well as the identification of troubleshooting activities mainte nance manual development and design of effective built in test requirements When the equipment enters service the Practical World im poses itself as shown in Fig 1 some faults that were antici pated will actually happen but some never do When a frac tion of the theoretically possible failure modes occur the weak nesses in a piece of equipment will become evident during the operation It can then be extrapolated that equipment which fail on one aircraft are more likely to fail on other aircraft of the same design operated in similar conditions But most im portantly many real world faults are not anticipated by the de sign engineers and therefore the traditional diagnostic systems do not resolve them In those cases human ingenuity may re solve the problem but where does that knowledge reside after its creation Some the knowledge can make its way back into troubleshooting manual updates 36 59 and some may be fed back to engineering to
57. t occurrences 5 NFF specific maintenance cost models for design justifi cation and NFF tracking 6 Modeling of complex interactions between system and components and their physics of failure 7 Modeling of intermittent failures from a fundamental per spective including standardized testing equipment and procedures ii 7 Acknowledgements This research was partially supported by the Engineering and Physical Sciences Research Council EPSRC Ministry of Defence BAE Systems Bombardier Transportation and Rolls Royce The Authors would like to express their thanks to Case bank Technologies Inc Copernicus Technology Ltd FlyBe UK and the RAF for sharing their experience with NFF problems 13 References 1 J Chen C Roberts P Weston Fault detection and diagnosis for railway track circuits using neuro fuzzy systems Control Engineering Practice 5 16 2008 585596 C Hockley P Phillips The impact of no fault found on through life en gineering services Journal of Quality in Maintenance Engineering 18 2 2012 141 153 J S Jeong S D Park Failure analysis of video processor defined as no fault found nff Reproduction in system level and advanced analysis technique in ic level Microelectronics Reliability 49 9 2009 1153 1157 M Pecht R Jaai A prognostics and health management roadmap for information and electronics rich systems Microelectronics Reliability 50 3 2010 317 323 P Soderholm
58. t there is a clear lack of fundamental understand ing on intermittency in electronics Also there is clear evidence to suggest that the current technology in use for detecting and locating the source of the intermittency is inadequate If NFF becomes worse over time despite im proved management processes then the cause is likely to be an inadequate equipment for testing electrical intermit tence In this case there needs to be a change in the way an electronic device or wiring harness is tested in order to solve the problem The nature of the NFF needs to be understood and tracked within equipment and if there is an intermittent NFF problem then the equipment requires NFF intermittency capable testing equipment Integrity Testing Most standard maintenance procedures employ only functional testing which determine if the equipment is within appropriate tolerances for service They do not capture the level of damage or degradation within the equipment information which could be vital for predicting the probability of intermittency or other fail ure modes Integrity testing should be incorporated into the maintenance process and data management techniques should then be developed to provide a diagnostic history and prognostic capability It is proposed that assessments of currently available testing methods should be investi gated and developed to provide this integrity assessment capability 3 Maintenance Manuals The current
59. tical implementation of bics for safety critical applications in Defect Based Testing 2000 Proceedings 2000 IEEE International Workshop on IEEE 2000 pp 51 56 A Bhatia J P Hofmeister J Judkins D Goodman Advanced testing and prognostics of ball grid array components with a stand alone monitor ic Instrumentation amp Measurement Magazine IEEE 13 4 2010 42 47 iD 1 D Kwon Detection of interconnect failure precursors using RF impedance analysis PhD Thesis University of Maryland 2010 B Steadman F Berghout N Olsen B Sorensen Intermittent fault detec tion and isolation system in AUTOTESTCON 2008 IEEE IEEE 2008 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 pp 37 40 B Sorensen Apparatus for testing multiple conductor wiring and termi nations for electronic systems u s patent no 8 103 475 2012 2012 O Muja D Lamper Automated fault isolation of intermittent wiring conductive path systems inside weapons replaceable assemblies SAE International Journal of Aerospace 5 2 2012 579 589 P Smith P Kuhn C Furse Intermittent fault location on live electri cal wiring systems SAE International Journal of Aerospace 1 1 2009 1101 1106 cited By since 1996 3 A W 672 Guidelines for The Reduction of No Fault Found NFF AR INC 2008 D Rosenth
60. tion is restricted to surface inspection of visible solder joints Conse quently leads and ball grid arrays cannot be inspected by opti cal means More sophisticated features concerning the solder volume fillet voids and solder thickness can reliably be deter mined only by X ray transmission Therefore by X ray inspec tion generally a better test performance is achieved in terms of false alarm rate and escape rate and it is to be favored for closed loop process control 89 The use of infrared imaging for non destructive evaluation of electrical component integrity is a well known practice 90 The basic principle of using infrared imaging as an integrity 10 test is that faulty connections and components in an energized circuit operating will begin to heat up before they fail the use of a thermoscope would scan the devices in the circuit from one end to another and the hotter the target the more energy that it will emit in the infrared portion of the electromagnetic spectrum For many electrical components such as resistors and capacitors the build up of heat will be entirely normal but for many components the build up of heat or even lack of heat will indicate a problem 4 1 1 Environmental Testing The environmental conditions of a product or system can also be analyzed to assess its on going health and to provide an advance warning of failure 54 91 Products often behave differently during varying operational conditions
61. tive Journal of quality in maintenance engineering 5 4 1999 335 348 R Granstrom P Soderholm Condition monitoring of railway wheels and no fault found problems International Journal of COMADEM 12 2 2009 46 53 S Henning R Paasch Designing mechanical systems for optimum diag nosability Research in Engineering Design 21 2 2010 113 122 P Phillips D Diston A knowledge driven approach to aerospace condi tion monitoring Knowledge Based Systems 24 6 2011 915 927 F S Nowlan H F Heap Reliability centered maintenance UNITED AIR LINES INC SAN FRANCISCO CA 1978 J Moubray Reliability centered maintenance Industrial Press Inc 2001 P D Eon Reducing nffs through knowledge sharing in 1st Annual Sym posium on Tackling No Fault Found in Maintenance Engineering 2013 M Pecht Prognostics and health monitoring of electronics John Wiley amp Sons 2008 M S G T Force Maintenance Program Development Document MSG 3 Washington DC Air Transport Association ATA of America 1993 A Ahmadi P Sderholm U Kumar On aircraft scheduled maintenance program development Journal of Quality in Maintenance Engineering 16 3 2010 229 255 G Huby J Cockram The system integrity approach to reducing the cost impact of no fault found and intermittent faults in UK RAeS Airworthi ness and Maintenance Conference 2010 S Kumar N M Vichare E Dolev M Pecht A health indicator method for deg
62. ts of the system that determine the NFF failure rate therefore a direct mitigating action during the design phase is likely to be more difficult It can be argued that any product removal that does not exhibit a failure dur ing subsequent acceptance test can be tagged as NFF Also for a number of these events further investigation could conclude that the reason for the removal event was categorically caused by an external effect None the less this would still be classi fied as a NFF event as these external influences might be faulty sensors or actuator or possibly an incorrect fault isolation ac tivity In any case as the device fabrication process continues to improve failure rates of hardware components have steadily declined over the years to the point where non hardware fail ures emerged as a dominant issue 9 where as the reduction of troubleshooting complexities and time to fix a problem seem to be the most important aspects when investigating failures of electronic systems In addition to the a priori discussions from Part I this paper focuses on the following No Fault Found Occurrences in Systems Emerging Resolution Practices Improvements in Test Abilities Discussion on Gaps in Literature Giver Qo ior Future Research Directions 1 Although there are specific approaches such as robust design 8 that can be used to design quality into products and processes by minimizing the effects of the causes of va
63. unctions in the way they interact and the lack of applying good practice principles In many cases desired sources of information are not readily available or are incorrectly configured to support rapid diag nostics or lack sufficient depth of information and practical ity Additional factors include the failure to complete or store documentation and the lack of robust diagnostic fault trees con necting event system faults 5 This results when a unit is re placed without determining the nature of the fault risking its recurrence to cause an NFF event The complexity brought by embedded software and electronics poses unprecedented chal lenges in maintenance and repair threatening customer satis faction and causing increasing warranty cost on repair 32 33 3 Emerging Resolution Practices From a technical standpoint an NFF tagged component is the result of an unsuccessful or inefficient troubleshooting regime of an unplanned maintenance event Several mainte nance strategies are usually sought to improve upon this prob lem within organizations 1 Reliability If all components were 100 reliable i e they never resulted in a system failure then there would be no unplanned maintenance activities Design engineers often engage in reliability improvements based largely on feedback from equipment in service However to the ex tent that engineers anticipate failures designers will incor porate fault detection syst
64. veloping methodologies and damage assessment algorithms are gener ally aimed at creating an in situ load monitoring and prognostic capability This is explored by Vichare et al 2007 54 who provides the necessary considerations for raw data processing during in situ monitoring and methods to reduce memory re quirements and power consumption These are key factors that often limit the integration of health monitoring systems partic ularly into aircraft Skormin et al 2002 55 developed fail ure prognostics for aircraft avionics using data mining models with measured parameters which included vibration tempera ture power supply functional overload and air pressure These parameters measured in situ use time stress measurement de vices The purpose of the model included understanding how the role of measured environmental factors impact upon a par ticular failure investigating the role of combined parameter ef fects and to re evaluate the probability of failure on the known exposure to adverse conditions 3 1 2 Knowledge Sharing Engineers have recently empathized that there is need for on field experience to be shared within a troubleshooting workflow repository 21 Aspects of content sharing such as e maintenance 56 can be beneficial for other maintenance personnel who will then be able to identify the cause of a prob lem on their first attempt whenever or wherever it next oc curs Furthermore the captured k
65. verwhelming majority of occurrences can be traced back to poor manufacturing i e soldering and Printed Circuit Board PCB assembly and in herent design flaws which include violation against specifica tions Vichare and Pecht 2006 10 Qi et al 2008 14 and Moffat 2008 15 have summarized some generic causes of failures within electronic systems 1 Interconnect failures including connectors 2 System design electrical and mechanical 3 Environmental conditions temperature moisture chemi cals mechanical stresses Operator handling ergonomics training Printed circuit Boards PCB Ageing components and connectors Loose PCB interconnectors Disconnected solder points Damaged wiring or cabling SOOS SON Sn E A recent aerospace survey 16 has ranked intermittent faults as the major cause of NFF events whereas Built In Test Equip ment BITE coverage and software are least likely This is contrary to the common belief that the majority of failures are due to incompatible or competing software routines between systems 17 Intermittency is arguably the most problematic of the NFF events due to their elusive nature making detec tion by standard test equipment difficult 5 The faulty state will often lay dormant until a component is back in operational use where it eventually causes further unit removals unless a genuine cause is found fault isolation It should be empha sized that these failures are not al
66. ways present during testing which make them troublesome to isolate This situation can re sult in repeated removals of the same equipment for the same symptom with each rejection resulting in the equipment being tagged as NFF 18 At this stage there is a very high proba bility that there will be a loss of system functionality integrity and an unacceptable compromise in safety requirements What is clear is that even though these faults may begin as short dura tion low frequency occurrences as time passes the underlying cause will increase the severity of the intermittency until even tually a hard fault appears and the functionality of the system is compromised or lost 2 1 1 Printed Circuit Board Interconnectors Information published by Gibson et al 1997 19 claims that between 50 70 of all electronic device failures could be attributed to its interconnectors Even though solder joints can fail by a variety of mechanisms the device interface seems to be the most common cause Over time contaminations on the fractured surfaces initiate a failure sequence which starts with degraded joints and eventually progress to intermittent failures Products that have a dependency upon the behavior of inter facing devices for correct operation are also susceptible to faults which can be categorized as intermittent This is common in products that rely on software for their correct operation or in teraction with other products
67. y remain and you are advised to consult the published version if you wish to cite from it CURVE is the Institutional Repository for Coventry University http curve coventry ac uk open Elsevier Editorial System tm for Reliability Engineering amp System Safety Manuscript Draft Manuscript Number RESS D 13 00244R1 Title No Fault Found Events in Maintenance Engineering Part 2 Root Causes Technical Developments and Future Research Article Type Review Article Keywords no fault found built in test troubleshooting failures fault tiagnostics testability maintainability test equipment Corresponding Author Dr Samir Khan PhD Corresponding Author s Institution Cranfield University First Author Samir Khan PhD Order of Authors Samir Khan PhD Paul Phillips PhD Ian Jennions PhD Chris Hockley MSc Highlights for review No Fault Found Events in Maintenance Engineering Part 2 Root Causes Technical Developments and Future Research Samir Khan Paul Phillips Chris Hockley gt Ian Jennions EPSRC Centre School of Applied Sciences Cranfield University College Road Cranfield Bedfordshire MK43 OAL gt Cranfield Defence and Security Cranfield University The Mall Shrivenham Oxfordshire SN6 SLA IVHM Centre School of Applied Sciences Cranfield University University Way Cranfield Bedfordshire MK43 OF Q Abstract This is the second half of a two paper series cover aspects of the NFF phenomenon wh

Post-print version - Curve

Contents

Download Pdf Manuals

Related Search

Related Contents