A chromosome-level haplotype-resolved genome assembly of oriental tobacco budworm (Helicoverpa assulta)

Xu, Yalong; Wang, Chen; Li, Zefeng; Zheng, Xueao; Kang, Zhengzhong; Lu, Peng; Zhang, Jianfeng; Cao, Peijian; Chen, Qiansi; Liu, Xiaoguang

doi:10.1038/s41597-024-03264-6

Download PDF

Data Descriptor
Open access
Published: 06 May 2024

A chromosome-level haplotype-resolved genome assembly of oriental tobacco budworm (Helicoverpa assulta)

Yalong Xu ORCID: orcid.org/0000-0002-9562-6158^1,2,
Chen Wang^1,2,
Zefeng Li^1,2,
Xueao Zheng^1,2,
Zhengzhong Kang^1,2,
Peng Lu^1,2,
Jianfeng Zhang^1,2,
Peijian Cao ORCID: orcid.org/0000-0001-9991-423X^1,2,
Qiansi Chen^1,2 &
…
Xiaoguang Liu ORCID: orcid.org/0000-0002-4506-3238³

Scientific Data volume 11, Article number: 461 (2024) Cite this article

367 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Oriental tobacco budworm (Helicoverpa assulta) and cotton bollworm (Helicoverpa armigera) are two closely related species within the genus Helicoverpa. They have similar appearances and consistent damage patterns, often leading to confusion. However, the cotton bollworm is a typical polyphagous insect, while the oriental tobacco budworm belongs to the oligophagous insects. In this study, we used Nanopore, PacBio, and Illumina platforms to sequence the genome of H. assulta and used Hifiasm to create a haplotype-resolved draft genome. The Hi-C technique helped anchor 33 primary contigs to 32 chromosomes, including two sex chromosomes, Z and W. The final primary haploid genome assembly was approximately 415.19 Mb in length. BUSCO analysis revealed a high degree of completeness, with 99.0% gene coverage in this genome assembly. The repeat sequences constituted 38.39% of the genome assembly, and we annotated 17093 protein-coding genes. The high-quality genome assembly of the oriental tobacco budworm serves as a valuable genetic resource that enhances our comprehension of how they select hosts in a complex odour environment. It will also aid in developing an effective control policy.

Genomic analyses reveal the stepwise domestication and genetic mechanism of curd biogenesis in cauliflower

Article Open access 07 May 2024

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Enzyme-assisted high throughput sequencing of an expanded genetic alphabet at single base resolution

Article Open access 14 May 2024

Background & Summary

The oriental tobacco budworm Helicoverpa assulta (Guenée) and cotton bollworm H. armigera (Hübner), commonly known as two sibling species, belong to the order Lepidoptera and the family Noctuidae. They are widely distributed across Africa, Oceania, and Southeast Asia¹, with both species playing significant roles as pests in agricultural systems. Moreover, they are commonly used as research materials in the field of entomology, boasting a substantial foundation of scientific studies. Morphologically, the two species are nearly indistinguishable at all stages, including the egg, larva, and pupal stages, and only identifiable during the adult stage by certain characteristics^2,3. Physiologically, they have the same major sex pheromone components of (Z)-9-hexadecenal and (Z)-11-hexadecenal⁴. Despite sharing some characteristics, they display marked variations in host range, resistance to pesticides, ratios of pheromone components, and reproductive capacity. The cotton bollworm is a typical polyphagous insect, able to feed on over 180 plant species, including cotton, maize, soy, wheat, and rice⁵.

Meanwhile, the oriental tobacco budworm primarily infests plants from the Solanaceae family, such as tobacco, tomato, and peppers^6,7. A noteworthy phenomenon is observed in the relationship between cotton bollworm and oriental tobacco budworm, where despite being distinct species, they exhibit significant genetic similarity, enabling them to interbreed and generate diverse progeny. Specifically, the successful crossing of female H. assulta with male H. armigera resulted in viable and fertile F1 hybrids. Conversely, the reverse cross of female H. armigera with male H. assulta produced F1 hybrids, which included fertile males and abnormal individuals but lacked fertile females⁷. Additionally, both species can successfully consume spicy pepper fruits; however, research findings revealed that H. assulta demonstrates a higher tolerance to capsaicin derived from Capsicum annuum compared to H. armigera⁶. Therefore, H. assulta is an exemplary model for investigating evolutionary patterns in insect feeding habits and elucidating the underlying mechanisms governing interactions with host plants.

This study presents a high-quality haplotype-resolved genome assembly of H. assulta at the chromosome level, achieved through the use of PacBio long reads, nanopore ultra-long reads, and high-throughput chromosome conformation capture (Hi-C) data. Utilizing Hifiasm⁸, we created three haplotype-resolved draft genomes: primary, paternal, and maternal, their genome sizes were 441.6 MB, 395.38 MB, and 404.67 MB, respectively. Following the correction of sequence errors and removal of haplotigs, the primary genome now stands at 415.19 Mb in size, with a contig N50 length of 13.99 Mb. Notably, all 33 primary contigs were successfully anchored onto 32 chromosomes, encompassing both Z and W sex chromosomes.

Furthermore, the genome assembly exhibited a high degree of completeness, as evidenced by the BUSCO analysis, which revealed 99.0% gene coverage. Repeat sequences constituted 38.39% of the genome assembly. A total of 17,093 protein-coding genes were identified, with 16,889 being functionally annotated. Transcriptome analysis indicated that 14,681 genes were expressed in at least one sample.

Methods

Sample collection

The larvae of H. assulta were collected from tobacco fields in the Xu Chang campus of Henan Agricultural University (113.80° E, 34.13° N) and reared continuously for more than ten generations in the laboratory. The insects were reared on an artificial diet under controlled conditions at 26 ± 1 °C, with a 14:10 (L:D) photoperiod cycle and 85% ± 5% relative humidity. Pupae and newly molted adults were selected for sequencing, and the adult insects that were used for sequencing had their wings removed before the process.

Genome sequencing and size estimation

The genomic DNA for PacBio HiFi sequencing was extracted from a newly molted female adult using the QIAamp DNA Mini Kit (QIAGEN). The DNA’s integrity was assessed using the Agilent 4200 Bioanalyzer (Agilent Technologies, Palo Alto, California). Subsequently, 15 μg of genomic DNA was sheared using g-Tubes (Covaris) and concentrated with AMPure PB magnetic beads. Each SMRT bell library was prepared using the Pacific Biosciences SMRTbell express template prep kit 2.0. The constructed libraries underwent size selection on a BluePippin™ system for molecules ≥ 15Kb, followed by primer annealing and the binding of SMRT bell templates to polymerases using the DNA/Polymerase Binding Kit. Sequencing was performed on the Pacific Bioscience Sequel II platform for 30 hours at the Annoroad Gene Technology company. Finally, a total of 1,933,848 high-quality HiFi reads were generated with a combined length of 38,425,265,244 bp; the detailed information about HiFi reads is listed in Table 1.

Table 1 Summary statistics of the Illumina HiFi reads.

Full size table

To perform Illumina second-generation DNA sequencing, one newly molted adult female and its parents were collected and rinsed with pre-cooled 0.9% saline to contamination, and frozen with liquid nitrogen. Genomic DNA was extracted from the collected samples using the sodium dodecyl sulfate (SDS) extraction method. After testing the DNA quality and integrity, it was randomly sheared by a Covaris ultrasonic disruptor. Illumina sequencing pair-end libraries were prepared using the Nextera DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA). Sequencing was performed using the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA). Raw reads were filtered using Fastp⁹ software (version 0.21.0) with the following criteria: removal of reads with adapter contamination, removal of reads with an N proportion greater than 5%, and discarding reads with a low-quality base count of 50% or more, where the quality value is less than or equal to 19 (Table 2).

Table 2 Summary statistics of the Illumina genomic DNA short reads.

Full size table

The Hi-C libraries were constructed using standard protocols as previously described¹⁰, with one newly molted female used as the input. The Hi-C sequencing library was then amplified by PCR (12–14 cycles) and sequenced on the Illumina HiSeq instrument, generating 154,726,330 paired clean reads with 2 × 150-bp reads.

We collected female pupae specifically for the construction of Oxford Nanopore libraries. The libraries were prepared using the standard protocol for Oxford Nanopore sequencing, specifically the Ultra-Long DNA Sequencing Kit protocol (SQK-ULK001). The purified library was loaded onto primed R9.4 Spot-On Flow Cells and sequenced using a PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK) with 72-hour runs at Novogene Corporation Inc., Tianjin, China. Basecalling of raw fast5 format data was performed using Oxford Nanopore GUPPY¹¹ software, removing low-quality reads with a sequencing quality value (Q) less than seven and retaining high-quality pass reads. The quality assessment report was generated using NanoPlot¹² v1.38.1. Finally, 254,496 Oxford Nanopore raw reads were generated with a combined length of 25,843,417,397 bp, and the detailed information is listed in Table 3.

Table 3 Summary statistics of the Oxford Nanopore raw reads.

Full size table

Genomic characteristics, such as genome size, repeat content, and heterozygous rate, were estimated based on K-mer frequencies. Utilizing K-mer analysis (K = 21) of Illumina short reads and PacBio HiFi long reads with Jellyfish¹³ v2.3.0, we estimated the overall genome size of H. assulta to be approximately 350 Mb using genescope2.0¹⁴. For the Illumina short-read reads, the genome size was estimated to be 354.98 Mb, with a heterozygosity rate of 2.08%; For the PacBio HiFi reads, the genome size was estimated to be 348.38 Mb, with a heterozygosity rate of 2.04% (Fig. 1).

Genome assembly

The chromosome-level haplotype-resolved genome assembly with trio binning was achieved using Hifiasm⁸ v0.19.5 software; this involved incorporating Illumina short paired-end reads from the parents, Illumina Hi-C paired-end reads, ultra-long ONT reads, and Pacbio HiFi reads. The primary contigs and two other haplotypes (paternal and maternal) contigs assembled by Hifiasm were further refined using Nextpolish2¹⁵ v0.1.0 software. This refinement process involved the use of PacBio long HIFI reads and Illumina short reads, resulting in the production of three draft genome assemblies.

Certain regions in a genome with high genetic diversity result in separate primary contigs for each haplotype instead of a single contig with an associated haplotig¹⁶. Whether you are working on the haploid or phased-diploid assembly, this can be an issue for downstream analysis. Hifiasm⁸ is a powerful assembler that can generate high-quality chromosome-level assemblies. Compared to other assemblers, it produces longer contigs and can resolve more segmental duplications. By using Hifiasm, we created three haplotype-resolved draft genomes: primary, paternal, and maternal, their genome sizes were 441.6 MB, 395.38 MB, and 404.67 MB, respectively (Table 4). Although Hifiasm can eliminate most duplications between haplotigs, it may incorrectly identify or fail to distinguish some heterozygous sequences. To address this issue, we used the Purge Haplotigs¹⁷ v1.1.2 software with long HiFi reads to remove haplotigs remaining in the three draft assembled genomes.

Table 4 Summary statistics of three draft Hifiasm assemblies for H. assulta.

Full size table

Assembly completeness was estimated by BUSCO¹⁸ v5.4.7 analysis and Illumina short reads mapping; the lineage dataset used in BUSCO is insecta_odb10, and bowtie2¹⁹ v2.5.1 software was used to align the purged genome assembly. The analysis identified 99.0% (single-copied genes: 98.7%, duplicated genes: 0.3%), 0.5%, and 0.5% of the 1,367 predicted genes in this genome as complete, fragmented, and missing sequences, respectively. These results suggested that the assembled genome is highly complete.

Genome scaffolding

These high-quality Hi-C sequencing clean reads were mapped to the trimmed draft genome using BWA 0.7.17²⁰ and filtered for unmapped and multiple mapped reads using Samtools v1.16²¹. The unique, high-quality paired-end reads mapped close to the restriction sites were retained for downstream analysis in the juicer²² v1.6 and 3d-dna²³ v180922 pipeline. Juicebox²³ was used to cluster the contigs into groups, and the order of the contigs was confirmed based on the strength of interactions between read pairs. During the process of grouping contigs based on Hi-C data, we observed that 33 contigs were grouped into 32 clusters (Fig. 2), with only one cluster (Chr16) containing two contigs. To ensure the accuracy of the connection between these two contigs, we used paired-end information from the sequencing data, if there are telomere repeat sequences present, confirm that they are located at the ends of the sequence. After correcting sequence errors and removing haplotigs, the final genome stands at 415.19 Mb, with an average length of 12.58 Mb after scaffolding (Table 5).

Table 5 Summary statistics of the final H. assulta genome assemble.

Full size table

RNA sequencing and analysis

We collected fourth and fifth instar larvae, female and male pupae, and newly emerged male and female adult moths for transcriptome sequencing and gene expression analysis. Before preparation and sequencing, we removed the midguts of the larvae and the wings of the adults. Subsequently, total RNA was extracted from the aforementioned samples using Trizol reagent (Invitrogen, USA) following the manufacturer’s protocol. Illumina RNA sequencing libraries were prepared by Annoroad Gene Technology Company. We performed RNA sequencing on 18 samples and obtained RNA-seq data with a total length of about 1209 gigabytes (Table 6). The total number of sequences is around 807 million, with an average proportion of bases having a quality greater than Q30 at 93.9% and an average proportion of clean reads at 97.25%. Clean data was obtained by removing adapters, low-quality reads, and high-content unknown sequences. All RNAseq data sequenced in this project have been deposited into the European Nucleotide Archive (ENA) with accession number PRJEB7091153. In addition to our sequencing data, we downloaded 39 transcriptome datasets from the NCBI SRA database which were merged with our dataset. Each sample’s data was aligned to the genome using HISAT2²⁴ to assess gene transcription levels. Analysis shows that more than half of the transcriptome samples exhibit a genome alignment rate of over 85%, and the genome alignment rate of the samples in this project (group a) is consistently around 90% (Fig. 3).

Table 6 Summary statistics of the Illumine RNA-seq short reads.

Full size table

Genome annotation

We first aligned the RNA-seq data mentioned above to the final genome using HISAT2²⁴ v2.2.1 and then predicted the transcripts with StringTie²⁵ v2.2.1. TACO²⁶ v0.73 was employed to merge the transcripts, retaining the high-quality ones. Next, we utilized TransDecoder v5.7.1 (https://github.com/TransDecoder/TransDecoder) to predict the protein-coding sequence. We initially built a de novo transposable elements (TE) library using the EDTA²⁷ v2.1.0 pipeline for repeat sequence annotation with the CDS file obtained from the TransDecoder results. Subsequently, we masked repeat sequences across the H. assulta genome using RepeatMasker²⁸ v4.1.2 against the de novo species-specific TE library generated by EDTA and the insect data from Dfam²⁹ 3.6.

Following the masking of these TE sequences, we integrated ab initio prediction, homology searching, and transcriptome-based approaches to predict protein-coding genes using the BRAKER3³⁰ pipeline with the parameters “--bam RNAseqs.bam --prot_seq = Arthropoda.10.pep.fa --min_contig = 10000 --addUTR = on --gff3 --threads = 48”. The annotated proteins of all arthropods were downloaded from OrthoDB³¹ v10, and RNA-Seq alignment bam files were generated by HISAT2. We used eggNOG-mapper³² v2.1.12 for functional annotation. Additionally, we searched the Uniprot³³ database using Blastp³⁴ v2.14.1 + and the Pfam³⁵ and KOfam³⁶ databases using HMMER³⁷ v3.4.

In the H. assulta genome, a total of 159.38 Mb sequences (38.39%) were identified as repetitive elements, as shown in Table 7. A total of 17,093 protein-coding genes were identified, with 16,889 being functionally annotated and expression analysis indicates that 14,681 genes were expressed in at least one sample (Table 8). In addition, we identified 86 rRNAs and 62 tRNAs. The Circos plot of the functional element we identified is shown in Fig. 4. All annotation files have been deposited into figshare.com³⁸.

Table 7 Repeats elements statistics of the H. assulta genome.

Full size table

Table 8 The statistical data on chromosome length, total gene count, and number of expressed genes.

Full size table

Sex chromosomes analysis

To identify the sex chromosomes (Z and W chromosomes) in H. assulta, we resequenced one female pupa and one male pupa using Illumina HiSeq platforms to obtain an approximate 50 × coverage. In males, the normalized coverage levels of sequence reads from the Z chromosome should be twice that of females. On the other hand, ideally, males do not have any DNA contribution from the W chromosome, while the autosomes should have equal coverage between males and females. Therefore, a difference in sequencing coverage ratio is expected for both Z and W chromosomes between sexes but not for autosomes. This difference can be used to identify sex-linked chromosomes. Using salmon³⁹, we computed the normalized coverage levels of chromosomes by mapping the resequencing reads to the final H. assulta genome with default parameters. To analyze and visualize the log2 of the male: female (M: F) coverage ratio, we used the R package changepoint v2.2.4 (https://github.com/rkillick/changepoint/). Remarkably, among all the chromosomes, it was observed that the sequencing depth of the longest chromosome (Chr1) is twice as high in males compared to females, leading to the conclusion that Chr1 is the Z chromosome (Fig. 5). Ideally, the length of the W chromosome should be similar to that of the Z chromosome and exhibit shallow sequencing depth in males. Only the second-longest chromosome (Chr2) meets both criteria, thus leading to the conclusion that Chr2 is the W chromosome.

Synteny analysis

To compare the genomic arrangement of H. assulta with its closely related species, cotton bollworms (H. armigera), we used annotated protein sequences anchored on chromosomes to perform synteny analysis through MCScanX⁴⁰ with default parameters. From the NCBI genome database, we obtained the reference genome HaSCD2 data (accession number: GCF_023701775.1) of cotton bollworms. The analysis showed that most of the chromosomes of the two moths exhibited good collinearity, with only a few chromosome fragments undergoing fission and fusion events. For example, although most of Chr6 of H. assulta was syntenic to Chr4 of H. armigera, a small part was syntenic to Chr29. We visualized the results using Tbtools⁴¹. Due to the absence of the W chromosome in the cotton bollworm reference genome, we did not observe any collinearity between the W chromosome of the H. assulta and any chromosome in the cotton bollworm genome (Fig. 6).

Phylogenetic reconstruction

To establish the evolutionary relationship between the tobacco budworm and other closely related species, we retrieved protein sequences of six species belonging to the Noctuidae family and one Coleopteran insect (T.castaneum) from the NCBI genome database and only the longest transcript for each gene was taken into consideration. OrthoFinder⁴² v2.5.4, with DIAMOND⁴³ v2.1.8, was used to identify orthologs and homologs. OrthoFinder successfully assigned 125918 genes (96.9%) to 14619 orthogroups. At least 50% of all genes belonged to orthogroups with eight or more genes (G50 was 8) and were contained in the largest 5245 orthogroups (O50 was 5245). There were 6498 orthogroups with all species present, and 2822 of these consisted entirely of single-copy genes.

For the phylogenetic analysis, we constructed a maximum likelihood phylogenetic species tree using the STAG method in the OrthoFinder⁴² program, rooted in STRIDE⁴⁴. Multiple sequence alignments of single-copy gene families were performed using MAFFT⁴⁵ v7.520 with the “-auto” parameter, and the alignment results were trimmed using trimAL⁴⁶ v1.4.rev15 with the “-automated1” setting. The alignments of all single-copy orthologs were concatenated to form a supergene.

We then utilized the mcmctree from the PAML⁴⁷ package to estimate the divergence time of the species in the tree. Divergence information obtained from the TimeTree⁴⁸ database (S. frugiperda vs S. litura 16.9–19.1 MYA, N. ni 70–80, and T. castaneum 195–361.6 MYA) was combined with mcmctree to constrain the divergence estimate. Subsequently, we visualized the time tree using the Figtree software (https://github.com/rambaut/figtree). The divergence time distance between H. assulta and H. armigera was estimated to be around 6.2 million years.

To analyze the expansion and contraction of gene families, we utilized the matrix tables of gene family orthologs obtained from OrthoFinder results. We applied these tables as inputs in CAFE⁴⁹ v5.0.0 and set a cut-off p-value of <0.05, allowing us to examine each gene family’s expansion and contraction (Fig. 7).

Data Records

The Nanopore, Hi-C, and Illumina sequencing data used for the genome assembly and annotation have been submitted to the European Nucleotide Archive (ENA) with accession number PRJEB70911⁵⁰. The final chromosome assembly has been submitted to the National Genomic Data Center (NGDC) under the accession GCA_963856015.1⁵¹. The H. armigera genome was downloaded from the NCBI genome database⁵². All public RNA-seq datasets used in the gene expression analysis were downloaded from the NCBI SRA database, and the corresponding project IDs were PRJEB6594⁵³, PRJNA587871⁵⁴, PRJNA590047⁵⁵, PRJNA592822⁵⁶, and PRJNA261645⁵⁷.

Technical Validation

The chromosome-level primary genome assembly was 415.19 Mb. For quantitative assessment of genome assembly, BUSCO¹⁸ analysis results showed that 99.0% of BUSCO genes (insecta_odb10) were successfully identified in the genome assembly, suggesting a remarkably complete assembly of the H. assulta genome. In addition, the genome alignment rate of HiFi reads is as high as 99.98%. The Hi-C heatmap revealed a well-organized interaction contact pattern along the diagonals within/around the chromosome inversion region, which indirectly confirmed the accuracy of the chromosome assembly.

To verify the completeness of our genome chromosome assembly, we conducted an analysis of telomere repeat sequences on each chromosome based on the genome repeat sequence annotation results. Initially, we analyzed the telomere repeat motif sequences of Lepidoptera insects in TeloBase⁵⁸, and we found that the majority of repeat motifs ranged from 5 to 9 bp in length, and (TTAGG)n/(CCTAA)n is the main motif in telomeres. After that, we identified regions within 15 kb at both ends of the chromosomes in our results where the length of repeat sequences exceeded 1k,and the repeat motif sequence ranged from 5 to 9 bp. Based on our analysis, we found that 21 chromosomes contain the typical telomeric motif (TTAGG)n/(CCTAA)n or a variant of the motif within 15 kb at both ends, while the remaining 11 chromosomes have the typical telomeric motif or a variant of the motif on at least one end (Table 9).

Table 9 Information of the telomere repeat sequence motif within 15 kb from both ends of the chromosomes with a length of over 1 kb and closest to the ends.

Full size table

In our investigation of sex chromosome determination, we utilized minimap2 to align genome contigs from two haplotypes (paternal and maternal) generated by the Hifiasm program with the primary final genome. The alignment revealed that contigs from the paternal haplotype could be aligned with all chromosomes except Chr2, while those from the maternal haplotype could be aligned with all chromosomes except Chr1 (Fig. 8). It is well-established that the sex determination in tobacco hornworms relies on two sex chromosomes, Z and W, where females possess a Z-W genotype while males have Z-Z. For this study, we employed single-headed female insects as the experimental material for genome sequencing. The analysis above reaffirmed our conclusion that Chr1 is the Z sex chromosome and Chr2 is the W sex chromosome.

Code availability

All bioinformatic tools were executed following their respective protocols and manuals. The software version used was described in Methods. Below is detailed parameter information about some bioinformatics tools.

Genome size estimation

jellyfish count -C -m 21 -s 50000000000 -t 32 reads_R*.fq -o reads.jf

jellyfish histo -t 32 reads.jf >reads.histo

genomescope.R -i reads.histo -o output_dir -k 21

Genome assembly

hifiasm -o hass --primary -t 48 --h1 hic_read1.fq.gz --h2 hic_read2.fq.gz \

--ul ont.reads.fq.gz hifi_reads.fastq.gz 2 > asm.log

yak count -k31 -b37 -t16 -o pat.yak paternal.fq.gz

yak count -k31 -b37 -t16 -o mat.yak maternal.fq.gz

hifiasm -o hass -t 48 -1 pat.yak -2 mat.yak /dev/null 2 > asm.trio.log

Purge haplotigs

minimap2 -t 48 -ax map-hifi hass.p_ctg.fa hifi_reads.fastq.gz --secondary = no | samtools sort -@ 48 -m 1 G -o hifi_read.aln.bam -T tmp.align

purge_haplotigs hist -b hifi_read.aln.bam -g hass.p_ctg.fa -t 48

purge_haplotigs cov -i hifi_read.aln.bam.gencov -l 15 -m 68 -h 140

purge_haplotigs purge -g hass.p_ctg.fa -c coverage_stats.csv -t 48

Genome sequences correction

yak count -t 48 -k 21 -b 37 -o k21.yak femal.illumina.reads.gz

yak count -t 48 -k 31 -b 37 -o k31.yak femal.illumina.reads.gz

nextPolish2 -t 48 -o curated.np2.fasta hifi_read.aln.bam curated.fasta k21.yak k32.yak

Hi-C data analysis

juicer.sh -s DpnII -g hass -z curated.np2.fasta -t 60 -p chrom.sizes

Busco analysis

busco -m genome -i genome.fasta -l insecta_odb10 -o busco_out --cpu 45 –offline

HiFi reads mapping

minimap2 -t 48 -ax map-hifi genome.fasta hifi_reads.fastq.gz > hifi_read.aln.sam

Transcript assembling

hisat2 -p 48 -q -x genome.index -1 $j.1.fq.gz -2 $j.2.fq.gz -S $j.sam

samtools view -bS -@ 10 -o $j.bam $j.sam

samtools sort -@ 10 -o $j.sorted.bam $j.bam

stringtie $j.sorted.bam -p 16 -o $j.gtf

ls *.gtf > gtf.list

taco_run -p 16 gtf.list

Repeat annotation

EDTA.pl --genome genome.fa --cds transcript.cds --sensitive 1 --threads 45 --anno 1 --overwrite 1 --species others --force 1

RepeatMasker -lib repeat.lib -pa 48 -html -xsmall -gff genome.fa > repeatmasker.log

Gene prediction

braker.pl --species = hass I am running a few minutes late; my previous meeting is running over.

--genome = genome.fa.mod.MAKER.masked I am running a few minutes late; my previous meeting is running over.

--bam rna.aln.bam \

--prot_seq = Arthropoda.10.pep.fa \

--gff3 --threads = 48 --workingdir = braker3_out --min_contig = 10000 --overwrite --addUTR = on

Genome annotation

emapper.py -i pep.fa -o pep.fa --itype proteins --cpu 32 --excel --evalue 1.0e-5

pfam_scan.pl -fasta pep.fa -dir PfamScan/data/35.0 -outfile pfam_out.tbl -e_seq1.0e-5 -e_dom 1.0e-5 -cpu 8

blastp -query pep.fa -db tremble_invertebrates -evalue 1.0e-5 -num_threads 16 -out blastp.tremble.out -max_target_seqs. 1 -outfmt 6 -subject_besthit

References

Fitt, G. P. The Ecology of Heliothis Species in Relation to Agroecosystems. Annu. Rev. Entomol 34, 17–53 (1989).
Article Google Scholar
Zhang, J. C. Y.-C. W. X. C. Y.-J. J. D.-X. A simple and reliable method for discriminating between Helicoverpa armigera and Helicoverpa assulta (Lepidoptera: Noctuidae). Insect Science 18, 629–634 (2011).
Article Google Scholar
Li, H., Zhang, H., Guan, R. & Miao, X. Identification of differential expression genes associated with host selection and adaptation between two sibling insect species by transcriptional profile analysis. BMC Genomics 14, 582 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhao, X. C., Yan, Y. H. & Wang, C. Z. Behavioral and electrophysiological responses of Helicoverpa assulta, H. armigera (Lepidoptera: Noctuidae), their F1 hybrids and backcross progenies to sex pheromone component blends. J Comp Physiol A Neuroethol Sens Neural Behav Physiol 192, 1037–47 (2006).
Article CAS PubMed Google Scholar
Wu, K. M. & Guo, Y. Y. The evolution of cotton pest management practices in China. Annu Rev Entomol 50, 31–52 (2005).
Article CAS PubMed Google Scholar
Ahn, S. J., Badenes-Perez, F. R. & Heckel, D. G. A host-plant specialist, Helicoverpa assulta, is more tolerant to capsaicin from Capsicum annuum than other noctuid species. J Insect Physiol 57, 1212–9 (2011).
Article CAS PubMed Google Scholar
Zhao, X. C. et al. Hybridization between Helicoverpa armigera and Helicoverpa assulta (Lepidoptera: Noctuidae): development and morphological characterization of F1 hybrids. Bull Entomol Res 95, 409–16 (2005).
Article PubMed Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Dryden, N. H. et al. Unbiased analysis of potential targets of breast cancer susceptibility loci by Capture Hi-C. Genome Res 24, 1854–68 (2014).
Article CAS PubMed PubMed Central Google Scholar
Sherathiya, V. N., Schaid, M. D., Seiler, J. L., Lopez, G. C. & Lerner, T. N. GuPPy, a Python toolbox for the analysis of fiber photometry data. Sci Rep 11, 24212 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M. & Van Broeckhoven, C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34, 2666–2669 (2018).
Article PubMed PubMed Central Google Scholar
Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–70 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun 11, 1432 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Article CAS PubMed Google Scholar
Pryszcz, L. P., Nemeth, T., Gacser, A. & Gabaldon, T. Genome comparison of Candida orthopsilosis clinical strains reveals the existence of hybrids between two distinct subspecies. Genome Biol Evol 6, 1069–78 (2014).
Article PubMed PubMed Central Google Scholar
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19, 460 (2018).
Article CAS PubMed PubMed Central Google Scholar
Manni, M., Berkeley, M. R., Seppey, M. & Zdobnov, E. M. BUSCO: Assessing Genomic Data Quality and Beyond. Curr Protoc 1, e323 (2021).
Article PubMed Google Scholar
Langdon, W. B. Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks. BioData Min 8, 1 (2015).
Article CAS PubMed PubMed Central Google Scholar
Jo, H. & Koh, G. Faster single-end alignment generation utilizing multi-thread for BWA. Biomed Mater Eng 26(Suppl 1), S1791–6 (2015).
PubMed Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–9 (2009).
Article PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst 3, 99–101 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput Biol 18, e1009730 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Niknafs, Y. S., Pandian, B., Iyer, H. K., Chinnaiyan, A. M. & Iyer, M. K. TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat Methods 14, 68–70 (2017).
Article CAS PubMed Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol 20, 275 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4, 4 10 1–4 10 14 (2009).
PubMed Google Scholar
Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA 12, 2 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bruna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform 3, lqaa108 (2021).
Article PubMed PubMed Central Google Scholar
Zdobnov, E. M. et al. OrthoDB in 2020: evolutionary and functional annotations of orthologs. Nucleic Acids Res 49, D389–D393 (2021).
Article CAS PubMed Google Scholar
Cantalapiedra, C. P., Hernandez-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol 38, 5825–5829 (2021).
Article CAS PubMed PubMed Central Google Scholar
UniProt, C. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49, D480–D489 (2021).
Article Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central Google Scholar
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res 49, D412–D419 (2021).
Article CAS PubMed Google Scholar
Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2020).
Article CAS PubMed Google Scholar
Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res 46, W200–W204 (2018).
Article CAS PubMed PubMed Central Google Scholar
Xu, Y. Gene function annotation of Helicoverpa assulta. figshare. Dataset. https://doi.org/10.6084/m9.figshare.24899421 (2023).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14, 417–419 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res 40, e49 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, C. et al. TBtools: An Integrative Toolkit Developed for Interactive Analyses of Big Biological Data. Mol Plant 13, 1194–1202 (2020).
Article CAS PubMed Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20, 238 (2019).
Article PubMed PubMed Central Google Scholar
Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Emms, D. M. & Kelly, S. STRIDE: Species Tree Root Inference from Gene Duplication Events. Mol Biol Evol 34, 3267–3278 (2017).
Article CAS PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30, 772–80 (2013).
Article CAS PubMed PubMed Central Google Scholar
Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–3 (2009).
Article CAS PubMed PubMed Central Google Scholar
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24, 1586–91 (2007).
Article CAS PubMed Google Scholar
Kumar, S. et al. TimeTree 5: An Expanded Resource for Species Divergence Times. Mol Biol Evol 39 (2022).
Mendes, F. K., Vanderpool, D., Fulton, B. & Hahn, M. W. CAFE 5 models variation in evolutionary rates among gene families. Bioinformatics 36, 5516–5518 (2021).
Article PubMed Google Scholar
European Nucleotide Archive https://identifiers.org/ena.embl:PRJEB70911 (2023).
European Nucleotide Archive https://www.ebi.ac.uk/ena/browser/view/GCA_963856015 (2023).
NCBI genome database https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_023701775.1 (2023).
European Nucleotide Archive https://identifiers.org/ena.embl:PRJEB6594 (2024).
European Nucleotide Archive https://identifiers.org/ena.embl:PRJNA587871 (2023).
European Nucleotide Archive https://identifiers.org/ena.embl:PRJNA590047 (2023).
European Nucleotide Archive https://identifiers.org/ena.embl:PRJNA592822 (2024).
European Nucleotide Archive https://identifiers.org/ena.embl:PRJNA261645 (2024).
Lycka, M. et al. TeloBase: a community-curated database of telomere sequences across the tree of life. Nucleic Acids Res 52, D311–D321 (2024).
Article PubMed Google Scholar
Xu, Y. RNA-seq analysis of oriental tobacco budworm (Helicoverpa assulta). figshare. Dataset. https://doi.org/10.6084/m9.figshare.24884526 (2023).
Xu, Y. The two haplotype draft genome sequences of Helicoverpa assulta assembled by hifiasm. figshare. Dataset. https://doi.org/10.6084/m9.figshare.24899049 (2023).

Download references

Acknowledgements

This project is supported by the CNTC Research Program [No. 110202201004 (JY-04)].

Author information

Authors and Affiliations

China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, 450001, China
Yalong Xu, Chen Wang, Zefeng Li, Xueao Zheng, Zhengzhong Kang, Peng Lu, Jianfeng Zhang, Peijian Cao & Qiansi Chen
Beijing Life Science Academy (BLSA), Beijing, 102209, China
Yalong Xu, Chen Wang, Zefeng Li, Xueao Zheng, Zhengzhong Kang, Peng Lu, Jianfeng Zhang, Peijian Cao & Qiansi Chen
Institution Henan International Laboratory for Green Pest Control, Henan Engineering Laboratory of Pest Biological Control, College of Plant Protection, Henan Agricultural University, Zhengzhou, 450000, China
Xiaoguang Liu

Authors

Yalong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zefeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Xueao Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhengzhong Kang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peijian Cao
View author publications
You can also search for this author in PubMed Google Scholar
Qiansi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoguang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.X. and X.L. conceived the research project. Z.K., X.Z. and J.Z. collected the samples. P.C. and P.L. downloaded the RNA-seq data and bioinformatics tools from NCBI and public sites. Y.X., C.W. and Z.L. performed the analyses. Y.X. wrote the draft manuscript. Q.C. and X.L. revised the manuscript.

Corresponding authors

Correspondence to Qiansi Chen or Xiaoguang Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, Y., Wang, C., Li, Z. et al. A chromosome-level haplotype-resolved genome assembly of oriental tobacco budworm (Helicoverpa assulta). Sci Data 11, 461 (2024). https://doi.org/10.1038/s41597-024-03264-6

Download citation

Received: 25 December 2023
Accepted: 15 April 2024
Published: 06 May 2024
DOI: https://doi.org/10.1038/s41597-024-03264-6