Home
BEEKeeper: Remote Management and Debugging of Large Scale
Contents
1. the functions that read and write a byte on the parallel port Although the intercepting functions could create and send a packet an entire packet just to send a byte of data is wasteful Instead we use a lazy send in which data that needs to be sent is put into a queue The queue will get flushed in two cases First if the buffer to hold stored data is full then it must get flushed to ensure no data is lost due to overflow Also when ChipScope requests to receive data the sent data also must be flushed This is to ensure that the data read from the ca ble is resulting from the input to the chip and is due to the fact that ChipScope blocks on reads In order to service the requested read the chip must be in a state assumed by software This also implies that there can only be one outstand ing read at any time 4 2 Packet Layout The JTAG protocol uses very few bit lines to transmit information due to the fact that every thing is done serially The lines in and out of the chip are shown in Figure 3 a Three lines TMS TDI TDO TCK GNDVCC a 9 bit JTAG Header Data sent from client computer to BEEKeeper board Data sent from BEEKeeper board to client computer b BEEKeeper Packet Format Figure 3 The 9 bit JTAG pin out data in 3 a and how it is packetized by the BEEKeeper system 3 b are used when sending data into the chip TMS TDI and TCK TCK clocks the input coming in on the other wires so th
2. using the par allel cable directly This however is to be ex pected due to the additional overhead of packing up the data transmitting it over the network and then unpacking it again 5 1 Testing Setup Our testing setup uses a single BEEKeeper module connected directly to a host computer through an Ethernet crossover cable The host computer is running RedHat Linux with Linux kernel 2 6 18 The BEEKeeper is then connected directly to the target BEE2 board with a ribbon cable whose signals are visible through a logic an alyzer We collected timing data from the host Frequency 0 00 1 23 1 46 6 1 92 2 15 2 38 2 61 2 84 0 3 Distribution of Data Request Round Trip Times 5 3 76 SSSA SP See SS eR SSS Sse eS SS PS ee ee ae eee PRS SS eS SS eae nS Sle wy See Round Trip Time of a Data Request in ms Figure 5 The frequency distribution of round trip times for data requests by ChipScope for a single bit from the target computer using the Linux kernel s timing fea tures to measure actual elapsed time 5 2 Speed Measurements As expected the transmission of JTAG data over TCP IP with our system is orders of magnitude slower than direct access with the parallel cable The sources of this slow down include network overhead as well as the time it takes to MicroB laze to process the incoming data and place it on the JTAG lines The latter is dependent on how the server software on the BEEKeeper is wri
3. 2 board at the Berkeley Wireless Research Center BWRC has created a useful platform for many research projects This board provides a large amount I O and processing power that is utilized in many multiprocessing applications There are four Xilinx Virtex II Pro 70 FPGAs on the board used for processing and linked by a single JTAG chain An additional Virtex II Pro 70 is on the board on a separate JTAG chain and is primarily used to control the other four FPGAs 3 The RAMP project uses the BEE2 boards for emulation Currently the prototype system is using 8 BEE2s boards but in order to work with systems that have thousands of cores the num ber of boards will need to scale greatly The demands of this scaling will put a great strain on the current system of debugging 7 The CASPER group develops radio astronomy tools for phased antenna arrays Large numbers of small antennas provide a cheap alternative to building a single large antenna but require a lot of back end processing to combine the data from the different antennas These tools such as beam formers and correlators scale in size based on the number of the antennas in the array The difficulties of scaling that are experienced in RAMP also arise in developing CASPER in struments Further problems arise when these instruments are deployed on site When some thing goes wrong on a board and can t be repro duced on a lab bench someone must travel to the antenna to de
4. BEEKeeper Remote Management and Debugging of Large Scale FPGA Arrays Terry Filiba Navtej Sadhal May 14 2007 Abstract We propose a solution to the problem of man aging and debugging the large array of Berke ley Emulation Engine 2 BEE2 FPGA boards which are part of the Research Accelerator for Multiple Processors RAMP project Currently communicating with individual FPGAs on a spe cific board in the cluster for programming or on chip debugging purposes requires physical access to the device and the connection of a specialized We have designed and implemented a solution us communication cable to a host machine ing a soft core on a small FPGA which con nects directly to a BEE2 board in the place of the host computer The host computer can then connect to the small unit the BEEKeeper over standard TCP IP and Ethernet This allows the host computer to manage many BEE2 boards st multaneously without physical access as well as aggregate data from many boards 1 Introduction The JTAG protocol has long been a valuable tool for chip developers and programmers For board level debugging JTAG chains provide a conve nient way to connect to a small number of chips but scaling past this level is hindered by the se rial nature of the protocol The Research Accel erator for Multiple Processors RAMP project leverages the low cost of field programmable gate arrays FPGAs to build large but cheap sys tems The pro
5. Driver USB v9 00 User s Manual 2007 3 Chen Chang John Robert W Brodersen reconfigurable computing system Design amp Test 22 2 2005 Wawrzynek and BEE2 A high end IEEE 4 Robert J Fowler Thomas J LeBlanc and John M Mellor Crummey An integrated approach to parallel program debugging and performance analysis onlarge scale multipro cessors ACM SIGPLAN Notices 24 1 1989 5 Aaron Parsons Donald Backer Chen Chang Daniel Chapman Henry Chen Patrick Crescini Christina de Jesus Chris Dick Pierre Droz David MacMa hon Kirsten Meder Jeff Mock Vinayak Nagpal Borivoje Nikolic Arash Parsa Brian Richards Andrew Siemion John Wawrzynek Dan Werthimer and Melvyn Wright Petaop second FPGA signal pro cessing for SETI and radio astronomy Proceedings of the Asilomar Conference on Signals Systems and Computers 2006 6 Brent Przybus Un tethered debugging Technical report Xilinx Inc 2005 7 John Wawrzynek Mark Oskin Christoforos Kozyrakis Derek Chiou David A Patter son Shih Lien Lu James C Hoe and Krste Asanovic RAMP A research accelerator for multiple processors Technical report Uni versity of California at Berkeley 2006
6. Driver interface to the Ethernet Port are added to the system puter running ChipScope is the client and the BEEKeeper board is the server The client is in control in this design and must initiate all com munication The BEEKeeper will be in a wait loop until the computer initiates communication Then the client will either send or request data until it is finished and closes the connection 4 1 Client Design The modifications on the client are all at the driver level Because ChipScope is closed source we could only intercept the data being sent by ChipScope through the driver The driver s source is available and has been modified to re move the existing parallel port interface The driver provides an interface to ChipScope that allows it to read and write data byte by byte The data is taken and put on the paral lel port using functions that immediately write to the hardware we have altered the hardware interface of the driver to send data over an Ether net port instead of a parallel port Since streams of data need to be communicated through the chip a lossy channel is not appropriate The com munication is done over TCP IP to ensure loss less communication The client currently has a software pro grammed IP address This aspect is what pro vides the scalability As long as the servers are connected to the internet the client can connect to any of them by selecting the correct IP ad dress We have intercepted
7. ating a new thread to deal with the failing chip exclusively the rest of the program ming is free to proceed unhindered The new thread can retry the operation and attempt to continue normal operation Then if the problem is unrecoverable the system can report a list of the chips that failed This system could also be used to monitor data running on the FPGAs By requesting the same data on the each chip the exact same method to program multiple FPGAs can be used to send the data requests over JTAG Errors can be dealt with in the same way as described for programming In this case the output from the chips also needs to be logged It can be recorded and viewed either by focusing on a single chip or viewing the data from multiple chips that was generated at the same time 7 Future work This work can be further explored in a number of ways Further benchmarking would be useful as well as some updates to improve the system Also it would be useful to find ways to reduce the overhead of packetizing and processing the data either by Section 5 gives results from timing tests done in a lab to see how using the TCP IP overhead slows down JTAG speed These tests do not ac count for network effects like dropped packets or Additional tests would be useful to get an idea of how a system like this could be used across longer distances and on lossy networks The BEEKeeper board was chosen because it is cheap and small However the
8. ay Rather being limited by the number physi cal parallel ports a computer now has access to any board in the system Currently the end to end system appears the same as before Chip Scope connects to a single JTAG chain to pro gram or debug it Unfortunately we do not have access to the ChipScope code to modify the end to end interface By rethinking the interface of the tools we can improve debugging for large systems Since the tool will communicate over TCP IP there is no longer a need for a kernel level driver Client software is sufficient to send data over the Ether net port We propose an interface for debugging multiple boards and some useful applications for this design The system should allow the user to design not only at the chip level but at the system level Instead of specifying the IP address of a single JTAG chain multiple chains can be added to al low for communication to different chips simul taneously Also it should be possible to group FPGAs together based on what they do In may applications the same programming is put on many FGPAs This could be done by opening connections to many addresses rather than just one Then the data that is normally transmitted to a single FPGA is transmitted to a group of addresses This will work as long as no errors occur since all of the chips should stay in the same state When errors begin to occur one data set is no longer appropriate for all the chips By cre
9. bug the problem 5 3 Related Work ChipScope Pro provides a platform for de bugging and programming Xilinx FPGAs over JTAG It provides some remote connection ca pability but this capability doesn t scale well A client computer can connect to a server in the lab that is also running ChipScope That server must be connected via a cable to the board Since the number of boards that can be con nected to a single server doesn t scale up well this doesn t provide a sufficient solution for RAMP or CASPER 6 The architects of the RAMP project have ex plored various debugging strategies with respect to processor interaction and logging Some of this functionality is planned in the RAMP design framework but much of it is dependent upon the system avoiding total failure 7 As we attempt to give the designer complete accessibility to on chip signals and programming our addition to the RAMP debugging framework should provide additional power in such scenarios There have also been other efforts at managed debugging of large scale systems from a soft ware standpoint Notably Fowler LeBlanc and Mellor Crummey of the University of Rochester propose an integrated system for debugging par allel programs running on shared memory mul tiprocessors 4 They explore a methodology for analyzing parallel programs and then develop a framework for debugging these programs on an Client Computer Figure 1 Initial debugging ar
10. chitecture mete nt Parallel Cable otal JTAG Cable The client computer is connected via a parallel cable to the BEE2 Components in purple Parallel to JTAG adapter Parallel Port and the portion of WinDriver that interfaces with the Parallel Port will need to be removed in order to improve scalability and remote connection capability SMP machine This includes monitoring each processor and keeping replay data and execution histories to be made available to an engineer at a single workstation There are notable devel opments in the user interface including script ing capabilities While hardware and software debugging differ in many respects the system developed by Fowler et al provides welcome inspiration to the problem of debugging large FPGA arrays 4 System Architecture The current method of debugging or program ming a BEE2 via ChipScope is described in Fig ure 1 The client computer runs ChipScope which provides a graphical interface to the user ChipScope communicates with a kernel driver to send data over a parallel cable connected to the computer The kernel driver is produced by a tool WinDriver that automatically produces source code and a makefile 2 The parallel ca ble is connected to a parallel to JTAG adapter This is a simple component that just rearranges the wires from the parallel standard to the JTAG header There is no software in this part and it is only necessary because the computer is c
11. e chip can determine when it is valid TMS sets the test mode and TDI contains the test data The only output from the chip is TDO whose validity is also de termined by TCK The packets constructed by the client need to contain the JTAG information TCK TDI and TMS Also it needs a way to distinguish if it is sending data or requesting to receive a packet In a single byte the JTAG specific data is arranged in the same order as in the JTAG header refer ring to Figure 3 Where the TDO bit would normally be there is a request data bit If this bit is high then the other data in the byte should be ignored and the server should read data from the JTAG port If the request data bit is low then the data should be sent to the chip This method allows for read and write requests to be TCP IP Packets JTAG Data Figure 4 Inside the BEEKeeper Board intermixed in a single packet to try to reduce the number of packets the client must send Packets sent by the server only need to contain TDO As Figure 3 b shows the single bit TDO is padded out eight bits While this may seem inefficient and an obvious point of optimization it turns out to be insignificant Because the re quest to get data must be serviced before it re turns there can only be one request outstand ing at a time as described in Section 4 1 This means that a packet can only contain one bit of TDO The overhead of using a single packet to send one bi
12. eceive bulk data in a different format and then generate the JTAG signals locally Such a sys tem would require understanding of the JTAG protocol as well as the development of a com munications interface that supports higher level communications In some respects this might work like the previously described remote Chip Scope ability that already exists However re placing the computer connected directly to the board with an embedded system significantly im proves scalability and packaging by allowing said system to be integrated entirely on the board While this might drive up costs it is worth ex ploring in effort to increase the power flexibility and speed of the debugging system 8 Conclusion We have presented and evaluated a remote and scalable system targeted to programming and debugging BEE2 boards This is achieved by modifying the communication between Chip Scope and and the JTAG interface to the board Because ChipScope is closed source we intercept the data bound for the parallel port at the driver level and reroute the data to the computer s Eth ernet port Using the TCP IP standard allows the data to be transmitted through the Inter net and arrive at our intermediate hardware the BEEKeeper The BEEKeeper is a small board that receives the packets and processes them into JTAG This board essentially interfaces one of the BEE2 JTAG chains to the network We did find additional latency as expected
13. from migrating from a nearly lossless channel the parallel port to TCP IP over Ethernet in our testing Also the fact that we could not modify ChipScope created additional inefficien cies Any reads from the JTAG cable requested by ChipScope have to be serviced immediately Because of this we must use an entire packet just to send one bit We believe that this reduc tion in speed while significant will have little effect on the debugging efficiency of an engineer because of the small amount of data that is ac tually communicated We believe that by modifying the user inter face for connecting to the chip improvements both in debugging capability and communica tion latency will result By queueing up multi ple receive requests in the same packet a single packet coming from the BEEKeeper board could be used to service all of the read requests Aside from this we believe there is plenty of room for future work in improving this communi cation link and reworking the user interface and debugging software We have laid the ground work for future innovations in working with large systems make necessary by projects like RAMP and CASPER As large systems begin to build momentum methods for the debugging of large FPGA arrays and other immensely parallel de vices should mature far beyond what we have discussed here 10 References 1 Spartan 3 mini module user guide Technical report Memec 2005 2 Win
14. ith the Ethernet port and a server that pro cesses the data sent from the client The network driver implements the TCP IP standard simi lar to the standard functions UNIX networking drivers provide The driver provides the frame work to establish a TCP connection with the client and send and receive packets The server software waits until a client makes a connection to it and begins sending data pack ets It will loop until it finds valid data to pro cess Then the software must determine how to use the JTAG cable It takes the data from the packets 8 bits at a time and determines whether it should send or receive data This is determined by the request data bit as outlined in Section 4 2 One important feature of the server design is portability The WinDriver interface with Chip Scope is build into the client This client is only useful for a board that can use ChipScope as its JTAG interface Unlike the client software the BEEKeeper server only relies on the packet format Although different boards use different numbers of pins for JTAG the definition of the JTAG header send by the BEEKeeper board can be configured with a packet as well 5 Results We have implemented our system as described with additional testing and data gathering por tions to measure its performance and usability The obvious result from running numerous tests is that the data transmission using the BEE Keeper is much slower than when
15. ject provides a multi FPGA plat form for the emulation of multicore or multipro cessor systems Large systems such as RAMP struggle with the scalability of JTAG in an 8 or 16 board system switching between boards being accessed via JTAG requires manual con nection of the target board to the server run ning the debug software The Center for Astron omy Signal Processing and Electronics Research CASPER also builds large scale processors out of many FPGA boards These processors need to be deployed at antennas but once deployed cannot easily be debugged remotely 7 We propose a system called BEEKeeper that will provide remote and scalable JTAG capabil ities Augmenting the current communication system to use Ethernet rather than parallel con nections will improve both scalability and acces sibility In Section 2 we describe the motivation for de signing this system and projects that can make use of the tool Section 3 describes other tools for remotely debugging FPGAs and debugging large systems in general Sections 4 and 5 explain the design of the system and how much latency is introduced Section 6 proposes a new interface that makes use of the capability to simultane ously connect to multiple boards Section 7 pro vides future plans to improve the analysis and design of the system Section 8 describes what we have learned from building this system 2 Background The design of the Berkeley Emulation Engine 2 BEE
16. om municating over a parallel cable Finally the JTAG cable is connected to the BEE2 board A typical machine only has a few parallel ca ble ports To use the remote debugging tools provided by ChipScope there would need to be a server for every few boards In order to provide scalability the parallel cable will be removed and replaced with an Ethernet cable Referring to Figure 1 the components in purple must be re moved Removing the parallel cable makes the parallel to JTAG converter unnecessary Then since ChipScope can no longer communicate over the parallel port on the computer part of the driver must be modified The interface from ChipScope to the driver remains the same but instead sending data over the parallel port it will packetize the data and send using TCP IP over Ethernet Figure 2 shows how the BEEKeeper system is designed The WinDriver is modified to inter face with an Ethernet port rather than a parallel port Then the Ethernet cable can be connected to a router and send data over the internet to the BEEKeeper board The BEEKeeper board de packetizes the data and sends it out over a JTAG header This is a client server model in which the com Client Computer Figure 2 BEEKeeper debugging architecture The parallel connection is replaced with an Ethernet cable and a small board to depacketize the data and translate it into JTAG Components in yellow BEEKeeper Board Internet Ethernet Port and Win
17. re isn t a need for the board itself We could integrate the nec essary parts of the BEEKeeper onto the board that is being debugged This would only require an Ethernet port an FPGA and a small mem ory to store the programming Then rather than having pins coming out of the board the connec tion between the BEEKeeper hardware and the FPGA can be wired on the board Integrating this hardware onto the board will create a small cost increase in the board but will ease debug ging Additionally we would like to implement the debugging interface described in Section 6 This would give the user better control of the system as a whole as explained Also this could allow for optimization benefits as well Currently the system has to use a whole packet for a single bit of data coming from the BEE2 board If the de bugging interface was open source some of this could be alleviated Since the program should know how much it wants to read from the board it can send a request for multiple reads in a sin gle packet or interleaved reads and writes Then when the BEEKeeper board sends a packet back it can contain as much data as the program re quested Finally we must consider that our experi ments have demonstrated that the serial nature of JTAG and its chattiness make it somewhat unsuitable for network transmission Given this it may be desirable to develop a more advanced device than our BEEKeeper which will actually r
18. t the parallel cable system is always sending the TDO data on a dedicated wire so it is al ways there for the client to read it This slow down can be expressed by comparing the round trip time of a data request in our system to the execution time of a read byte operation on the Figure 5 shows the distribution of round trip times seen during JTAG commu parallel port nication The average round trip time to get a single bit of data from the TDO line is 1 3ms and over 90 of round trips are below 1 5ms This time results from all of the previously de scribed sources of delay aside from packet loss due to network congestion On the other hand the when reading from the parallel port the data has already reached the local machine so it only takes 1 8us to read a bit and return it to the software This discrepancy is expected because we are effectively using an entire TCP IP packet to send one bit of data Regardless of this significant decrease in speed the actual effect on debugging interac tion while noticeable should not actually im pact debugging productivity except in the most extreme cases Given that Xilinx Virtex II bit files are on the order of 1MB transmitting such a file at 167kbps would take roughly one minute rather than being nearly instantaneous as with the parallel cable 6 Proposed Debugging Inter face The system we have implemented provides a way to use the existing ChipScope software in a novel w
19. t far outweighs the overhead of the 7 extra bits used as padding in the data The TCP IP header sent along with the single bit is a much more significant amount of overhead and as described in Section 7 is a better focus for optimization 4 3 Server Design The server consists of an Avnet Xilinx Spartan 3 Mini Module board which is referred to as the BEEKeeper board This board provides the I O necessary to connect an Ethernet cable and a JTAG cable The BEEKeeper board has an Eth ernet port and 76 pins of I O directly to the chip far more than what is needed for a JTAG header The board has a flash memory as well as a con figurable Spartan 3 35400 FPGA 1 For development purposes this board has been mounted on an Avnet Mini Module Base board The board uses the I O pins on the Mini Module to provide standard forms of I O that are useful for development There is an LCD many LEDs and switches RS 232 and USB connections as well as JTAG to program the Spartan 3 This allows us to monitor the I O moving between the Ethernet connection and the JTAG and debug the BEEKeeper system The version of the BEEKeeper that would be released will not include this board Figure 4 shows how the BEEKeeper is pro grammed A MicroBlaze soft processor core is put onto the Spartan 3 This allows us to ac cess the I O channels and program the server using software The software implementation has two components a driver to communicate w
20. tten and results in a hard limit on the top speed at we can transmit data When using the parallel cable directly the client computer sets the JTAG clock rate to 5MHz meaning the serial communication occurs at 5Mbps The ChipScope software uses timers to maintain this rate and thus not violate the setup or hold times of the device being accessed In contrast when our server software is running on the MicroBlaze processor and transmitting data as fast as possible the JTAG clock speed is reduced to 2 18MHz as measured by a logic ana lyzer at the connection between the BEEKeeber and the BEE2 This slow down is due only to the time it takes the processor to unload data from its buffer examine it for read requests and then send it out on the data line and does not take into account any network delays that might slow down data even further Examining the actual flow of data through the whole system we found that our clock rate was further reduced to an average of 167kHz This means that communication over our system is about 30 times slower than it had been over the parallel cable This slow down can be attributed to the network overhead and the lack of com pression in our data stream An additional source of delay is the fact that only one data request by the client computer can be outstanding at any time That is every time the client wants to read the TDO line it must actually request the data from the server In con tras
Download Pdf Manuals
Related Search
Related Contents
FICHA TÉCNICA COMPOST ORGÁNICO Quick Reference Guide F:\Daten zum sortieren\Ensoniq Mirage Service Manual\Ensoniq SE888 English quick start guide ク - ZERO-G TODA UNA FAMILIA DE DIMMERS DIGITALES ZyXEL P-663HN-51 User's Manual C M - ー 。 5 型 Copyright © All rights reserved.
Failed to retrieve file