Analyzing gene sequences with HMMER

HMMER is a bioinformatics software used to analyze gene sequences to quickly determine the degree of similarity between two sets of sequences, and the latest version is 3.1. However, the official website of HMMER only provides the Linux version, and the Windows version is inactive. Maybe the sequence analysis software can only work well on Linux? However, a quick glance at the official documentation says: “We have been working on HMMER4 since 2011, but it has been in a slow state of development”, which may be the real reason for the discontinuation of HMMER for Windows.

Installing HMMER

  1. Download

Since I can’t open a virtual machine just to use HMMER for Linux (it doesn’t seem to be impossible), I looked for a historical version. Using the known download links for Linux, I found that HMMER keeps all of its historical software backed up in this place, so I know that the last Windows HMMER version was 3.0.

After downloading and unzipping, there are a bunch of things that I can’t understand, there is no .exe file that can be opened directly by double-clicking, so I probably need to knock the command to start the program. I searched for the installation method, and the following is a detailed demonstration.

  1. Installation

For Windows 10 users, open the Control Panel, search for Environment Variables, click Edit System Environment Variables, and select Environment Variables.

Windows 11 users can open the Start Menu and type “Environment Variables” directly into the search bar at the top.

Just click on the first match.

Find Path in System Variables and click Edit.

Click the New button on the right and fill in the path to the HMMER.

If you don’t know what a path is, go to the place where you put the HMMER, copy what’s in the address bar according to the picture below, paste it into the variable above, and when you’re done adding it, just OK it all the way.

  1. Testing

The next step is to test it to see if it works. You need to use something that you can type commands into. The most common ones are CMD and Windows PowerShell, just choose one of them.

To open CMD: WINDOWS + R, type cmd and click OK to open the command prompt.

Open Windows PowerShell: Right-click on the Start menu and select Windows PowerShell.

Then type

hmmscan -h

We recommend copying and pasting directly to avoid errors. If something like the following appears, it’s installed.

图示

Using HMMER

I’ve taken a look at the tutorials on the web to demonstrate the functions I’m currently using: hmmbuild and hmmsearch.
hmmbuild: create hmm models (roughly)
hmmsearch: analyze the similarity (should be correct)

  1. Get pfam ID

We need to use the Hidden Markov Model of the gene to do sequence comparison, and to get the Hidden Markov Model, we need to get the pfam ID of the conserved protein structural domain of the gene, we can either get the pfam id by referring to other people’s work in the literature or by searching in NCBI by ourselves, the latter is described in the following.

Go to the NCBI protein database and enter a keyword, the species keyword needs to be either its Latin name or the official English name, for example, the MADS-box gene of Jatropha curcas, e.g. “MADS-box Jatropha curcas”.

Select any of the search results to view the details, and then click “Identify Conserved Domains” on the right to search and analyze the conserved domains of the protein.

As you can see from the results, the target protein is hit in the K-box family and the pfam id is already given in the list

  1. Download protein conserved domain comparison sequence

Click directly on the pfam id above to jump to the details of the conserved structural domains of the protein family, click on the source pfam

On the pfam details page, click Alignments, select Stockholm format, and click generate to download the multiple comparison sequence in .txt file format.

  1. Download species comparison sequences

Download the Jatropha genome protein data. Go to NCBI’s FTP site, find genomes, and use Ctrl+F to search for its Latin name. The Latin name of Jatropha is Jatropha curcas, so we can try to search with Jatropha.

Select protein in the next directory, download protein.fa.gz, unzip it and get the protein.fa file

  1. Comparison and Analysis

Copy the two files you downloaded above into the HMMER folder, or any other folder, and open the command prompt.

By default, the command prompt operates in the directory C:\Users\Users>, and we need to switch to the directory where HMMER is located to continue the operation.

If HMMER is not on disk C, you need to switch the disk drive, for example, disk D, then type in

D:

Press Enter to enter the D drive, and similarly for other disks. Then switch to the location of HMMER

cd HMMER (installation location)

for example

cd D:\hmmer

If the error is reported, it may be because the path contains Chinese or other non-English characters, you need to use English single quotes to wrap the path, such as this

cd 'D:\ pretend to have Chinese\hmmer'

Next, use the hmmbuild command to convert the obtained protein conserved domain comparison sequences into an hmm model, I downloaded a file named PF01486_seed.txt, then enter

hmmbuild hmm (files Files to be converted)

for example

hmmbuild PF01486.hmm PF01486_seed.txt

The converted hmm is then compared to the jatropha protein sequence using the hmmsearch command by typing

hmmsearch PF01486.hmm protein.fa > PF01486.out
全程参考

At the end of the run, the PF01486.out file is generated, right-click to open it with Notepad

You will be able to see the results of the comparison.

Leave a Reply

Your email address will not be published. Required fields are marked *