RepeatMasker Installation and Use – Genomic Repeat Sequence Annotation

RepeatMasker is a specialized software for genome repetitive sequence identification annotation, and classification statistics for almost all species. It is essential software for studying genomes, non-coding RNAs, transposons and mitotic collaterals and other related fields. Many small RNAs, lncRNAs are closely related to Repeat regions.

Software Installation and Configuration

The installation environment for Ubuntu 16.04.2 x64, all the relevant software and databases are the latest version of the article release. This article is installed with Root privileges to provide services for all users to use, no permissions for small partners only need to download and install the software in their own folder, configure repeatmasker set all the relevant software location can be set up, will not set the environment variables always use the full path name of the program to run RepeatMasker can be.

  1. RMBlast serial search engine

http://www.repeatmasker.org/RMBlast.html

2.6.0 ver 2 2017-3-29

# Download the RMBlast source package and edit it
cd ~/bin/
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.6.0/ncbi-blast-2.6.0+-src.tar.gz
wget http://www.repeatmasker.org/isb-2.6.0+-changes-vers2.patch.gz
tar zxvf ncbi-blast-2.6.0+-src.tar.gz
gunzip isb-2.6.0+-changes-vers2.patch.gz
cd ncbi-blast-2.6.0+-src
patch -p1 < ... /isb-2.6.0+-changes-vers2.patch
cd c++
. /configure --with-mt --prefix=/usr/local/rmblast --without-debug
make
# Install the program and libraries into the system directory, there are errors, but rmblastn, which we need, is already working fine
sudo make install # Makefile:40: recipe for target 'install-toolkit' failed
# Test if the program installed successfully
/usr/local/rmblast/bin/rmblastn -h

2.TRF (Tandem Repeat Finder) search for tandem repeats

http://tandem.bu.edu/trf/trf.download.html

4.09 2016-2-22
Latest version 4.09, this OS requires the 64-bit version of one of the legacies to run

cd ~/bin/
wget http://tandem.bu.edu/trf/downloads/trf409.legacylinux64
chmod +x trf409.legacylinux64
sudo cp trf409.legacylinux64 /usr/local/bin/trf
# Test with help messages to use
trf 

3.RepeatMasker program

http://www.repeatmasker.org/RMDownload.html

4.0.7 2017-2-1

# Download the 267MB installer, it's also slow, put it in the background and take your time!
nohup wget -c http://www.repeatmasker.org/RepeatMasker-open-4-0-7.tar.gz &bg
tar xvzf RepeatMasker-open-4-0-7.tar.gz

4.Repbase database

http://www.girinst.org/server/RepBase/index.php

Registration is required to download, manual approval, may take two days.
RepBaseRepeatMaskerEdition-20170127.tar.gz (48.84 MB)
You can also download it from my Baidu.com and upload the server to the same directory as the RepeatMasker download.
http://pan.baidu.com/s/1c2zSMKo

mv RepBaseRepeatMaskerEdition-20170127.tar.gz RepeatMasker/
cd RepeatMasker/
tar xvzf RepBaseRepeatMaskerEdition-20170127.tar.gz

5.Configure RepeatMasker dependencies

# Default perl, repatmasker, trf installation location is correct on all the way back to the car, the search quotes choose 2 RMBlast, enter RMBlast installation directory /usr/local/rmblast/bin, and then select the 5 done to complete; for example, I have just deduced the latest version of the installation directory: /usr/local/rmblast /bin; if the installation of the new version fails, download the old version of the pre-edit installation directory: /usr/local/rmblast-2.2.28/bin
. /configure 
# Add to global environment variables
sudo ln -s `pwd`/RepeatMasker /usr/local/bin/RepeatMasker 

Examples of software use

  1. Arabidopsis thaliana and phragmites genomes as an example
# Show basic usage, parameters and description of the program
RepeatMasker 
# Displays the program's detailed help manual
RepeatMasker-help 

# Example of an Arabidopsis analysis
# Go to the directory where I store the Arabidopsis genome
cd ~/ref/phytozome/Athaliana/TAIR10/assembly
# Create the results output directory
mkdir repeat
# Run the program: parallel is to select the number of threads; species is the species name, see the help for common species, write lowercase Latin genus name or full name in quotes if there is no species name; html and gff is to output the results in html and gff format, convenient for viewing and downstream analysis; dir outputs the results directory; the genome fa file must be placed at the end of all the parameters; time 8min
time RepeatMasker -parallel 30 -species arabidopsis -html -gff -dir repeat Athaliana_167_TAIR9.fa 

# Example of phytochrome analysis, 274MB genome in 30 threads in 13min time
cd ~/ref/phytozome/Bdistachyon/v3.1/assembly
mkdir repeat
time RepeatMasker -parallel 30 -species brachypodium -html -gff -dir repeat Bdistachyon_314_v3.0.fa

At the beginning of the run, the release time version of the database and the species-specific data information will be displayed, which needs to be annotated.
( Full database: DC20170127-RB20170127 )
In /mnt/bai/public/bin/RepeatMasker/Libraries/dc20170127-rb20170127/brachypodium

  • 201 Ancestral and universal sequences for Brachypodium
  • 282 Lineage-specific sequences for Petunia
  1. Description of the results file

*stands for the name of your genome

1.*.out.gff: Repeat sequence genome annotation file, similar to gene annotation, most important result

    # Results Preview
    Chr1 RepeatMasker similarity 1 107 13.2 - .       Target “Motif:ATREP18” 561 649  
    Chr1 RepeatMasker similarity 1066 1097 10.0 + .       Target “Motif:(C)n” 1 32  
    Chr1 RepeatMasker similarity 1155 1187 17.1 + .       Target “Motif:(TTTCTT)n” 1 33 

    1.*.tbl: Repeat Sequence Annotation Results Report information summary table overview
    2.*.out.html: web version of result details, same as RepeatMasker online annotation result report
    3.*.masked: replace large items annotated as repetitive regions with N genomes.
    4.*.out: default input result format of RepeatMasker, the information is basically related to gff.
    5.*.cat.gz: file for comparing sequences with duplicates.
    Frequently Asked Questions

    1. RMBlast installation problems
      NCBI has not updated rmblast ftp://ftp.ncbi.nlm.nih.gov/blast/executables/rmblast/LATEST since 2.2.28, 2013. I tried to install the source package on Ubuntu 16.04, but I can’t install it;
      On RepeatMasker’s page there is the recent ncbi-blast-2.6.0+-src source code and patch, install it as required, i.e., the operation in this article, make edit succeeds, but make install has an error, however, the key program rmblastn has succeeded and can be used normally;
      If the installation of the new version fails, you can try to install the pre-compiled version of 2.2.28
    cd /usr/local
    wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.28/ncbi-blast-2.2.28+-x64-linux.tar.gz
    wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/rmblast/2.2.28/ncbi-rmblastn-2.2.28-x64-linux.tar.gz
    tar zxvf ncbi-blast-2.2.28+-x64-linux.tar.gz
    tar zxvf ncbi-rmblastn-2.2.28-x64-linux.tar.gz
    cp -R ncbi-rmblastn-2.2.28/* ncbi-blast-2.2.28+/
    rm -rf ncbi-rmblastn-2.2.28
    mv ncbi-blast-2.2.28+ rmblast-2.2.28
    /usr/local/rmblast-2.2.28/bin/rmblastn -h

    If 2.2.28 is installed successfully, the rmblast location in the configure repeatmasker is changed to /usr/local/rmblast-2.2.28/bin/rmblastn

    1. trf runtime error
      It is due to the version compatibility problem of Linux legacy GLIBC, so the author provides two versions, if the original one is not available, try the following another version
    wget http://tandem.bu.edu/trf/downloads/trf409.linux64
    chmod +x trf409.linux64
    ./trf409.linux64
    1. Rpeatmasker can’t find dependencies when running
      It is because the RepeatMasker file . /configure step in the RepeatMasker file is set incorrectly, new again, double-check the location of each dependent program, you can run normally.
      The prerequisite is that you run the related dependent programs first to see if they can be run!
    2. No result directory and result
      I added -dir to specify the output directory, but there is no result.
    time RepeatMasker -parallel 30 -species arabidopsis -html -gff -dir repeat Bdistachyon_314_v3.0.fa  

    You must have forgotten to create the results folder, the program doesn’t create its own directory, mkdir repeat is a must. You have two choices, you can either create the folder in advance, or you can skip the -dir result parameter and output all the results to the current directory.

    Leave a Reply

    Your email address will not be published. Required fields are marked *