Help:Does VectorBase provides masked sequences
From VectorBase Help System
[edit] Masking
Part of the process of annotating a genome involves repeat masking the genomic sequence prior to gene prediction. We use repeatmasking as a method to reduce the search space for the mania of genome annotation - finding genes.
- Vector masking
Vector sequences are clipped from the WGS reads prior to assembly so that we expect a low level of vector contamination in the final consensus genome assembly. There is probably a correlation between the quality of the assembly, in terms of scaffold/contig N50, and the potential level of vector contamination. It really shouldn't be a problem for most projects.
- Repeat masking
We collate known repeats and transposon elements (from GenBank or specific public databases such as TEfam) with de novo repeat finding data (using RECON and RepeatScout) to generate a RepeatMasker lib file. This is then used to mask those regions of the assembly which are deemed to be repetitive.
- Low-complexity filtering
We run the Dust and TRF softwares to indentify low complxity and Tandem Repeats.
We make the genome sequences available in so-called 'softmasked' format. This means that the sequence files are not masked (i.e. nucleotides are converted to N's) rather those regions which are repetitive are converted to lower case (see the very simple example below which hopefully makes this clearer).
>my genome
GATGCTAGCTGAGGAGGAGCGTCTAGGCTATGCTAGCT
>my genome (masked)
GATGCTAGCTNNNNNNNNNCGTCTAGGCTATGCTAGCT
>my genome (softmasked)
GATGCTAGCTgaggaggagCGTCTAGGCTATGCTAGCT
You can download softmasked fasta files from the VectorBase website and use that as the starting point for your analysis. Alternatively, the Ensembl API allows for the extraction of defined sequence regions in both the raw and masked formats.
As with most bioinformatic analysis we cannot state that *all* repeats have been identified and marked up. Keeping this in mind it is always worthwhile checking that the designed oligo sequences map uniquely to the genome to reduce cross-hybridisation problems downstream in your analyses. There are bound to be examples of sequence repeated at low copy numbers (maybe part of multi-gene families) which is not picked up by the repeatmasking software. In fact we do not want these regions to be masked as they are genic and important from an annotation perspective.
| | This Help-related article is a stub. You can help VectorBase by expanding it. |
