The Whole proteome assessment gives an estimate of the proportion of the proteomes made-up from gene models consistent with known homologs, and what proportion may be mistakes. It is based on comparison to the gene contents of all extant species of the same lineage from target species.
Consistent: Proportion of genes whose closest gene families are from the selected lineage.
Contamination: Proportion of genes whose closest gene families is from another lineage and likely come from contamination or horizontal gene transfer (Multiple genes are linked to representant of this lineage)
Inconsistent: Proportion of genes whose closest genes families is from another lineage, but likely result from noise (Likely dubious gene models/spurious annotation of non-coding region)
Unknown: Proportion of genes with no closest homologs found - could be spurious gene or orphan species specific genes
Genes from the three first categories can also be labeled as:
- Partial mapping
Genes that have less than 80% of the sequence with shared k-mer content from its closest gene family.
Genes with a length less than half the median gene content of its closest gene family
The gene family is not represented in the proteome.
High proportion of either of those can indicate dubious gene models or spurious genes.
A high-quality proteome typically has a high consistent proportion, no contamination, and low partial mapping and fragments.