Horticulture Research

Browse Articles

Article|29 Dec 2022|OPEN

RAfilter: an algorithm for detecting and filtering false-positive alignments in repetitive genomic regions

Jinbao Yang^1,2 ^,† , Xianjia Zhao^2,3 ^,† , Heling Jiang² ^,† and Yingxue Yang² ^,† , Yuze Hou² , Weihua Pan,^1,2 ^,

¹College of Informatics, Huazhong Agricultural University, Wuhan 430070, China.
²Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
³Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Henan Zhengzhou, 450001, China
*Corresponding author. E-mail: panweihua@caas.cn
^†Jinbao Yang,Xianjia Zhao,Heling Jiang,Yingxue Yang contributed equally to the study.

Horticulture Research 10,
Article number: uhac288 (2023)
doi: https://doi.org/10.1093/hr/uhac288
Views: 2855

Received: 10 Oct 2022
Accepted: 16 Dec 2022
Published online: 29 Dec 2022

Abstract

Telomere to telomere (T2T) assembly relies on the correctness of sequence alignments. However, the existing aligners tend to generate a high proportion of false-positive alignments in repetitive genomic regions which impedes the generation of T2T-level reference genomes for more important species. In this paper, we present an automatic algorithm called RAfilter for removing the false-positives in the outputs of existing aligners. RAfilter takes advantage of rare k-mers representing the copy-specific features to differentiate false-positive alignments from the correct ones. Considering the huge numbers of rare k-mers in large eukaryotic genomes, a series of high-performance computing techniques such as multi-threading and bit operation are used to improve the time and space efficiencies. The experimental results on tandem repeats and interspersed repeats show that RAfilter was able to filter 60%–90% false-positive HiFi alignments with almost no correct ones removed, while the sensitivities and precisions on ONT datasets were about 80% and 50% respectively.