Of the 20 common amino acids, 18 are encoded by multiple synonymous codons. Those synonymous codons are not redundant, and all of them contribute to protein expression, structure and function. Therefore, it is useful to know the rules about the synonymous codon selection in a species and design the heterogenous gene with efficient expression in the host.
In the natural host, most of the synonymous codons of the gene have been evolutionarily selected and related to the protein expression and function. However, for the target gene expression, most of the existing codon optimization tools prefer to select the high-frequency-usage codons and neglect the low-frequency-usage codons of the host.
In this study, we have developed the method Presyncodon with web version to predict the gene coden from its protein sequence with the evolutionary information in the expression host. The synonymous codon usage pattern of peptide was learned from the big data of genomes (Escherichia coli, Bacillus subtilis and Saccharomyces cerevisiae). The machine-learning models were constructed to predict synonymous codon (low- or high-frequency-usage codon) selection in a gene. All possible synonymous codon selection tendency of the middle residue in the fragment was predicted by the predicting model and stored in the PostgreSQL database.
Now, the method could be easily and efficiently used to design new genes from protein sequences for optimal expression in the three expression hosts (E. coli, B. subtilis and S. cerevisiae).
Presyncodon, a web server for gene design with evolutionary information of the expression hosts. Jian Tian, Qingbin Li, Xiaoyu Chu, Ningfeng Wu and Yunliu Fan. Submitted.