Home

Visualisation of gene coexpression networks

1. 7 822967 246635 0 966666666666667 575511 232591 9 983333333333333 728978 809489 0 966666666666667 641120 550607 0 983333333333333 551164 646611 9 966666666666667 PU20555x 800878 0 966666666666667 553823 814344 0 983333333333333 831922 282518 0 983333333333333 263735 652796 0 966666666666667 260319 422211 9 966666666666667 gt 260319 560645 0 966666666666667 551652 226685 0 966666666666667 550607 650939 0 966666666666667 550607 573011 9 983333333333333 768006 576252 9 983333333333333 559028 815294 0 966666666666667 10000236 799561 0 966666666666667 242903 592785 0 966666666666667 551505 10000173 0 983333333333333 241946 831406 90 966666666666667 826191 653408 0 966666666666667 650939 559545 0 983333333333333 175690 10000833 0 966666666666667 837174 PU10604x 0 983333333333333 818390 246635 1 818390 551163 0 983333333333333 818390 PUOO588x 0 966666666666667 246635 551163 0 983333333333333 246635 PUOO588x 0 966666666666667 829550 PU10604x 0 983333333333333 551163 267391 9 966666666666667 671045 10000758 0 966666666666667 648812 568494 0 983333333333333 PU00588x 267391 O 983333333333333 552545 643213 0 966666666666667 592872 10000173 O 966666666666667 753976 PU04402x 0 966666666666667 Plain Text v Tab Width 8 Ln 13239 Col 35 INS Figure 10 An example output file Visualisation in Cytoscape The produced data is visualised in cyto
2. Visualisation of gene coexpression networks Applied functional genomics 7 5 ECTS 2009 Supervisor Torgeir R Hvidsten By Peter Boman amp Daniel Decker Table of Contents Eos 010 10618 0 0 lamers renee errata E eee ere ere ee ee eee en ee eee eee ee ee eee 2 Prob EA Desp ae E T a cleats 3 Users imanoal and Prora ACCESS e E tbe ieusudens 3 PLO CA deser DUON sesa E T aka nba aeneerieh aad eee eae 4 putot probe To sene CA Aas sara heat abe A ie eee eth A 4 COMMECIIMIG Probes 1072 CIC viata cs ides tcaenc a r l OE 4 CROO MM CTO ANT AY 6 5 arc dietccs E E E eases 6 DUT spn tts ae no esse etc braless ara cause A nd Vast tla ated aN nam baat ual tebe obearaeeines 9 Visuahsation mm VlOSCADC ssjusctetindesate es icuiauavaietncuieasay N 10 LINA TON S eer E i uaussdaca he E S 11 5 6 ed 11 S118 6 cere ene a E E E E een eae er eere erent eee 11 RCIE DCE oia a E E E E E E O OAT E TO E ES 12 Introduction Umea Plant Science Center UPSC has during the years produced huge amounts of poplar microarray data Sterky et al 2004 The microarrays have been produced from plants grown under different growth conditions and stresses During this project we have compared data from different microarrays to investigate what genes that follows a similar expression pattern in the different microarrays When these networks have been found this knowledge can be used to create experiments that verifies if these genes are coexpressed Gr nlund et al 2009 Sp
3. ainputspLit z z 1 chomp arraydata Figure 5 Input of the file name were the microarray data is stored Here an array with all microarray names is created for indexing purposes haer gor vi en array med namnen pa alla arrayer my index my 1 my size push index PU exp for 1 1 1 lt arraydata 1 if substr arraydata i 2 eq PU s1ze 1 1 Garraydata else push index arraydata 1 Figure 6 This algorithm creates an array with all microarray names Here all wanted arrays index numbers are found and stored in an array haer hittar vi vilket indexnummer arrayen har i var ind xarray array array array my j my Garrayquery chomp index chomp arrayval for 1 0 i lt size i for j 0 lt arrayval j chomp 1index 1 chomp arrayval j if substr index 1 6 8 eq substr arrayval 0 8 push arrayquery i else chomp arrayquery Figure 7 The microarrays that are to be examined are stored in a separate array Here is a sub routine in which a gene name is inserted and the average values of all probes connected to this gene is calculated This is done for all chosen microarrays The subroutine then returns an array with all values Outlying and missing values will be removed haer goer vi medelvaaarden my medelarray sub getmedel med
4. ation The correlation values are then stored in an output file The correlation values are finally used in a coexpression network using cytoscape The network consists of nodes that represent genes and edges that represent correlation above a certain threshold User s manual and program access The program can be obtained by sending an email to pebo0002 student umu se or dade0002 student umu se The program requires that you are able to run perl programs and that you got input files in the right format The required input files are as follows The microarray datafile need to start with PU exp followed by the name of the microarrays tab separated subsequently followed by the name of the probe and the expression values tab separated and in the same order as the microarray they belong to For example PU exp array 1 array PU00001 value from arrayl value from array PU00002 value from array 1 and so on The file that contains which probes that are connected to which genes need to start with PU tab before the probes in the format PUxxxxx can start The first probe needs to be named PUOOOO the second PU00002 and so on Five tabs after the PUxxxxx the gene name should be presented For example PU PUOOOO onetab twotab threetab fourtab gene name comments for variable length PU00002 onetab twotab threetab fourtab gene name and so on Our file can be downloaded from http www populus db u
5. eLarray my puprober my daniel shift my proteinvarde protprobe daniel puprober spLit s proteinvarde my medel 0 my valuel 0 my divider 0 my value2 0 j 0 lt arrayquery j for i i lt puprober i my index2 index value2 substr puprober i 2 7 index2 Svaluel Sarraydata value2 S arrayquery j if valuel lt 10 amp amp Svaluel gt 10 amp amp Svaluel 0 medel medel vaLluel divider divider l if divider 6 medel medel divider push medelarray medel divider 0 medel 0 return medelarray Figure 8 This algorithm gathers the microarray value from all arrays that are to be examined and calculates the average of those that origins from the same gene Here a spearman rank correlation test is made for every gene versus every other gene The results are sorted to a separate file for positive correlation and one for negative A percentage ticker shows how large portion of the samples that has been screened Output for f 0 f lt proteins f expvalue getmedel proteins f done donet tot2 f 1 if expvalue eq aaa for l f 1 l lt proteins 1 expvalue2 getmedel proteins 1 if expvalue2 eq aaa c Statistics RankCorrelation gt new expvalue Gexpvalue2 n c gt spearman if n gt tres open MYOUTFILE gt gt ou
6. earman rank correlation is used to asses whether to genes are coexpressed Spearman rank correlation is a non parametric method of measuring correlation without any assumption of correlation between the variables It 1s the pearson product calculated on ranks Two sets are compared by sorting after the values of one column both columns are then ranked and the square of the difference between the ranks summarized this value is the d in formula 1 i 6d p n n 1 Formula 1 The pearson corelation coefficent This produced a number of coexpressed genes depending on the similarity of the analysed microarrays The coexpressed genes was then visualised using a network drawing tool Cytoscape TM Shannon et al 2003 Problem Description The program should be able to read an input file containing microarray data and a file that connects probes to genes Each microarray consist of some ten thousand probes some probes are unique while some are different probes from the same gene Each probe got their own expression value The probes that comes from the same gene needs to be averaged One has to be able to select which arrays to compare to create coexpression values from trees grown in different conditions The next step is to create an array profile with the averaged expression value of a gene from each of the selected microarrays The created array is then compared using Spearman s Rank Correlation to all other genes to determine their correl
7. is true it is a probe and five elements later five tabs later in the original data in the array is the gene name The probe name is then pushed into a hash as key with the gene name as value If no gene name is present the gene is denoted with the probe name Probes gt gt gt protein my ip my placep my counterp for ip 1 ip lt datainputsamlLingp ipt placep substr datainputsamlingp ip 0 2 if placep eq PU and Length datainputsamlingp ip eq 7 counterp ipt5 my protp datainputsamlingp counterp my peter Length protp if peter eq ng or peter eq 1 protp join datainputsamlingp ip x tab1 datainputsamlingp ip protp else Figure 2 The code for creating an hash table with probe name as key and gene name as value An array is created containing all the hash keys from the probe to gene hash This array is then used to bring out each element of the hash It 1s then checked to determine if it has been run before If this is true then the old value is joined with the new one If not true a new key is created with the gene as key and array spot as value The created hash is then printed my b 0 my protprobe my proteins my tablkeys keys stabl1 for my 1 0 S i lt tablkeys 1 my Snyttprotein tab1 tablkeys 1 for my j 0 j lt G proteins j j my oldprotein proteins j if nyttprotein eq So
8. ldprotein a Soldvalue protprobe nyttprotein my Snewvalue tablkeys 1 deLete protprobe nyttprotein my summa join Soldvalue newvalue protprobe S nyttprotein summa b 1 else if b 0 push proteins nyttprotein protprobe nyttprotein tablkeys 1 b 0 print protprobe Figure 3 The algorithm to check whether the probe already exists Choosing microarrays Which microarrays to examine is determined here and the accepted input format is specified It is only critical that the arrays are separated by one character white space Haer besdtammer du vilka micrroarraydatan som ska foras in print n What arrays do you want to use Write their name in this way M XXXX XXX XXXX XXX XXXX XXX N my z 0 my arrayval my infilel lt STDIN gt chomp infilel print infil infilel my peterdaniel split s infilel chomp peterdaniel push arrayval peterdaniel chomp arrayval print arrayval Figure 4 Input of microarray names Here an file name input is requested The chosen microarray data file must be present in the same folder as your program print n Please insert your arraydata file name n my infile lt STDIN gt open DATA infile my datainput lt DATA my Garraydata z 0 while z lt datainput my datainputsplit split s datainput z push arraydata dat
9. mu se data genespring tab All data is case sensitive Program description The program is able to read different input files that supply the data for the probe to gene determination and correlation calculations as long as the input is in the right format Different microarrays can then be chosen by the user and correlation thresholds can be set The code will here be described Input of probe to gene data First a request for data is printed The user may then insert a file name that will be opened The program creates an array were each element is a line from the input file Then each element is tab separated and pushed into a new data array print Welcome n print Please enter the name of your input probe gt gt protein mapping file n my infilep lt STDIN gt open DATA infilep my datainputp lt DATA gt my proteinsp my protp my zp 0 my datainputsamlingp my datainputspLitp my tabl while zp lt datainputp my datainputsplitp split t datainputp zp push datainputsamlingp datainputsplitp zp zp 1 HHRHHRHHHHH HAHAE EEEE EE Figure 1 The algorithm for reading in a text file and put each tab in and element in an array Connecting probes to gene Since the data file starts with PU exp the second element in the array has to be used as start value in the loop A sub string is created from the two first characters from each element to see 1f they are PU If this
10. rrelated hits with a correlation over 0 95 Many of the genes are unknown and without extensive examination it is hard to determine whether this is a correct network of the coexpressed genes Reference A Gronlund R P Bhalerao amp J Karlsson 2009 Modular gene expression in poplar a multilayer network approach New Phytologist vol 181 p315 322 Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin N Schwikowski B Ideker T 2003 Cytoscape a software environment for integrated models of biomolecular interaction networks Genome Research Vol 11 p 2498 2504 Sterky F Bhalerao RR Unneberg P Segerman B Nilsson P Brunner AM Campaa L Jonsson Lindvall J Tandre K Strauss SH Sundberg B Gustafsson P Uhlen M Bhalerao RP Nilsson O Sandberg G Karlsson J Lundeberg J Jansson S 2004 A Populus EST resource for plant functional genomics Proc Natl Acad Sci Vol 38 p 13951 13956
11. scape The picture below shows how the different genes correlate to each other Green edges lines indicate positive correlation whereas negative correlation 1s shown by red edges The gene are represented by nodes dots To simplify the picture a gene can be chosen together with its closest correlated neighbours as shown in fig 12 correlated genes Figure 12 A few genes with the genes that are positivly correlate to them Limitations The file format needs to follow the examples exactly or the data and results will be corrupted The program will not print any error messages if the input 1s wrong and will run through all steps All input files needs to be placed at the same directory as the program If PU in uppercase is present in other positions than in PU numbers your values might be distorted All genes must have values in all examined microarrays if values are missing the gene will be discarded Conclusions The program is able to calculate the spearman correlation between two vectors As long as the data is in the right format It is important to use several microarrays to minimize the amount of false positives due to random correlation As approximately 15000 genes are compared this will give approximate 112 million comparisons random positives are therefore likely to occur In our test runs microarrays from poplar samples grown in long day seven different microarrays were evaluated This resulted in around 10000 positively co
12. tput txt print MYOUTFILE proteins f proteins l n n print expvalue och expvalue2 Funkar tillsammans n n if n lt tres open MYOUTFILE1 gt gt outputminus txt print MYOUTFILE1 proteins f proteins l n n procent done tot 100 print done of tot procent done n print Complete n print All highly positive correlations is saved in output txt n all highly negative correlation is saved in outputminus txt n close MYOUTFILE Figure 9 Here the spearman correlation is determined for between each gene and those with a correlation higher than the set threshold will be saved in the outout file The produced data is printed to a textfile cs Applications Places System eo output fong txt Desktop program gedit File Edit View Search Tools Documents Help Sm Q R New Open Save Print Jnd Red Cut y Past Find Replace e medel pl amp long vs short amp outputminus long txt amp 3 output long txt 560050 553823 0 966666666666667 817358 218199 9 983333333333333 175839 290317 90 983333333333333 560869 577047 0 966666666666667 PU24294x PUOO588x 0 966666666666667 724002 589889 0 966666666666667 PU10719x 837174 0 983333333333333 PU10719x 829550 0 983333333333333 PU10719x PU10604x 1 10000867 681204 0 983333333333333 822967 250895 0 966666666666667 822967 818390 0 96666666666666

Visualisation of gene coexpression networks

Contents

Download Pdf Manuals

Related Search

Related Contents