Analysis of Dinucleotide Bias and Genomic Signatures Across Cyanobacterial Genomes

265 | P a g e J U L Y 1 8 , 2 0 1 4 Analysis of Dinucleotide Bias and Genomic Signatures Across Cyanobacterial Genomes Ratna Prabha and Dhananjaya P. Singh* 1 Mewar University, Gangrar, Chittorgarh, Rajasthan-312 901 ratnasinghbiotech30@gmail.com 2 National Bureau of Agriculturally Important Microorganisms, Kushmaur, Maunath Bhanjan 275103 INDIA dpsfarm@rediffmail.com (*corresponding author) Abstract


INTRODUCTION
The 'dinucleotide relative abundance' or 'frequency of dinucleotides' in any nucleotide sequence is identified as a 'general design' and closely related organisms in comparison to distant one, tends to have identical or similar general design [1].

ISSN 2348-6201
266 | P a g e J U L Y 1 8 , 2 0 1 4 Due to these facts, dinucleotide relative abundance is depicted as genomic signatures that tend to be specific across different DNA samples and were found efficient to explain the variance across the DNA sequences of prokaryotes and eukaryotes including viruses and hosts and also provide information on their variation at codon sites [2][3]. Dinucleotide relative abundance value is found to be consistent throughout the genome and involves contribution of genome-wide processes such as replication, recombination and repair in it. Environmental factors such as ecology (e.g. energy sources and systems), temperature extremes, g-radiation damage, osmolarity gradients along with transfer of genomic DNA between organisms (either directly or indirectly) concurrently imposes impact on the genomic signature [1]. Dinucleotide relative abundance differentiate specifically imitate structural features of DNA such as duplex curvature, supercoiling etc. [4]. Dinucleotide relative abundance founds its root in dissimilarity measures calculated from dinucleotide counts and is also utilized for assessing evolutionary distances between homologous sequences as an alignment-free approach computation. Phylogenetic analysis on the basis of dinucleotide relative abundance distance (or "delta-distance") is specifically useful for whole genomes and provides logically sound results [5][6][7].
Dinucleotide relative abundance profile is found to be very stable and consistent throughout the genome even when only 50 kb fragments are considered [2,4]. This stability is a resultant of many factors like limitation on dinucleotide stacking energy and DNA helicity, mechanisms of replication and repair and context-dependent mutation pressures [1,4,5,8,9].
The genome signature is also able to identify putative horizontally transferred DNA as it is typical for a given bacterial genome. Due to its species-specific character, this genomic signature allows recognition of anomalous genomic regions [10,11,12].
Special attention was provided towards studying prokaryotic genomes for analysis of biases in nucleotide composition and organization along with short oligonucleotide combinations held there in [13][14][15]. Much emphasis is given towards analyses of dinucleotide frequencies and codon usage [10,[16][17][18][19]. In archea and bacteria, usage of oligonucleotides are related to multiple properties as DNA base-stacking energy, codon usage and DNA structural conformation. Further, it is reported that prokaryotic DNA tends to be correlated in short range and information is encoded in short oligonucleotides.
As compared to AT-rich and host-associated genomes, oligonucleotide usage vary more in GC-rich and free-living genomes [20].
Cyanobacteria (blue-green algae) represent one of the eleven major eubacterial phyla and extremely diverse group of prokaryotes in terms of their physiological, morphological and developmental characteristics. They are ancient group of photosynthetic prokaryote with a great distinction in term of their habitats, cellular differentiation strategies and physiological capacities [21]. In the last decades, increased technological developments in DNA-sequencing have facilitated sequencing of a number of cyanobacterial genomes comprising different physiological groups and species.
Complete genome sequences of group of microbes including cyanobacteria allow a close inspection of genomic features and characteristics within and between different species.
This study was carried out to analyse dinucleotide frequencies and average absolute dinucleotide relative abundance difference across different cyanobacterial genomes.

Calculation of dinucleotide relative abundance value
We determined the dinucleotide relative abundance value for each of the 41 cyanobacteria using the following equation: where fXY denotes the frequency of dinucleotide XY and fX and fY denote the frequencies of X and Y, respectively.

Calculation of average absolute dinucleotide relative abundance difference
The dissimilarities in relative abundance of dinucleotides between two sequences (f and g) were calculated from Genome signature comparisons (δ*-differences) (webserver http://www.cmbl.uga.edu/software/delta-differences.html). This webserver computes δ*-differences using the following equation: Karlin et al, 1997) Program first divides each genome to non-overlapping segments of ~50,000 bp, then calculates the δ* value for each pair of segments from the two genomes, and gives the average of all comparisons between 50 kb segments multiplied by 1000 for convenience .

Statistical analysis
Statistical analysis i.e. calculation of mean, standard deviation and correlation analysis was carried out with SPSS 16.0 software.

Comparison of dinucleotide relative abundance values across genomes
The distribution pattern of the frequencies of 16 dinucleotides i.e. symmetrized 10 dinucleotides of 41 species of cyanobacteria is shown in Table 1. Our study indicated that TA is broadly underrepresented followed by CG and then AC+GT as shown earlier [4,23,24]. Slight variation is observed in distribution pattern of CC+GG followed by TG+CA.
Particularly for these two set of dinucleotides, it iswas observed that they followed an average to overrepresented distribution across all cyanobacteria (Table 1). CC+GG was found in higher occurrence in members of order Prochlorales and across Cyanothece species. AA+TT and GC occupied a major portion of dinucleotide distribution across all members of dataset. Underrepresentation of TA was observed to be influenced by GC-content of the members as it tended to decrease when there is an increase in GC content. Members with low GC content showed underrepresentation of CG (Table 1). Frequency range for TA was found to be highest followed by CG. Both of these dinucleotides show their distribution in a wide range which was also evident from Table 1. Wide distribution range was also observed for CC+GG, TG+CA and GC. Rest of the dinucleotides which generally involved combination of one strong nucleotide and one weak nucleotide (AC+GT, AG+CT, TC+GA) with the exception of AA+TT and AT showed narrow range of frequency as compared to rest of the dinucleotides (Figure 1). J U L Y 1 8 , 2 0 1 4

Relation between DRDA and GC content
Mean and standard deviation was computed for GC content of genomes and each type of the dinucleotide relative abundance value for all of the 41 cyanobacteria under consideration (Table 2). From the table, it is evident that TA, CG, AC + GT and AT are least occupied ( Table 2) whereas rest of the dinucleotides occupied values which are quite similar to their mean and also did not shows much deviation in their distribution pattern (Table 2).
We further carried out correlation analysis between GC content of each genome and all the possible dinucleotide combination to assess the nature of relationship shared in between them (Table 2). GC content was found to be negatively correlated with TA, CC + GG, AG + CT suggesting that GC-rich organisms are devoid of these particular nucleotides, whereas it is positively correlated with AT, TG+CA and CG, suggesting their dominance in organisms with high GC-content (Table 3).  Table 3. Correlation between GC-percentage and each of the dinucleotides in 41 cyanobacteria (**Correlation is significant at the 0.01 level (2-tailed), *Correlation is significant at the 0.05 level (2-tailed)).

Discussion
In our analysis, TA is broadly underrepresented followed by CG and then AC+GT, AA+TT and GC occupied a major portion of dinucleotide distribution across all cyanobacterial genomes. Underrepresentation of TA seems to be influenced by GC-content of the members as it tends to decrease when there is an increase in GC content. Furthermore, GC content is negatively correlated with TA, CC + GG and AG + CT suggesting that GC-rich organisms are devoid of these particular nucleotides. GC-content is positively correlated with AT, TG+CA and CG and thus suggests their dominance in organisms with high GC-content. An interesting feature in this group of cyanobacteria is their GC content. Members with similar GC content possess similar pattern of genomic signature and grouped together as a single clade with other species that although come from different taxonomic orders, have similar GC content. Habitats also seem to influence the dinucleotide relative abundance values of the organisms because it is suggested that marine organisms show almost similar pattern of genome signature and group together as a single clade in cluster obtained on the basis of genomic signature difference.
Similar is the case with organisms exhibiting either freshwater, land or multiple habitats. Average dinucleotide relative abundance distances are larger between genomes of different species in comparison to within genomes. This discrimination clearly specifies that the compositional variation of any particular genome is governed by the factors that are specific from genome to genome. Furthermore, all of the 16 dinucleotides or 10 symmetrised dinucleotides exhibit their own DNA structural preferences [4]. The dinucleotide TA remains mostly underrepresented [4,23,24]. It is most likely due to the lowest stacking energy of TA among all the dinucleotides which eventually allow necessary flexibility for unwinding of the DNA double helix. TA is also a part of many regulatory sequences (e.g. TATA box, polyadenylation signals) and so restricted TA usage may help to avoid improper binding of regulatory factors [1,4]. Thus, universal under-representation of TA is an expected outcome of the extraordinarily low stacking energy in cyanobacterial genomes.