The MOCAT pipeline processes raw output files in FastQ format by first quality trimming and filtering the reads. In a second step, it's possible to screen the reads for contaminants, if for example you're analyzing bacterial sequences you can easily remove any eukaryotic reads.

The second step can also be used to estimate taxonomic abundance using a reference database. Provided with MOCAT are the RefMG.v1 database (which contains single copy marker genes from 1,753 bacterial reference genomes) and the database (extracted marker genes from 263 metagenomes and 3,496 bacterial genomes). Either of these databases can be used to generate taxonomic profiles of metagenomes, the first using bacterial references and the second using single copy marker genes.

The third step is assembly of reads into scaftigs. After assembly, an additional assembly revision step is performed, to improve the initial assembly.

After assembly and/or assembly revision, genes can be predicted on the assembled sequences. And then single copy marker genes can be extracted from among the predicted genes.