I recently bought the Full Genomes "Long Read" (10xGenomic's "Chromium") product of myself. This uses 10xGenomic's "Chromium" machine to cut DNA into 50 or so kilobase pieces. These are then separated, chopped into smaller pieces appropriate for Illumina sequencing, and each smaller piece is barcoded with an identifier for the long piece. These are then submitted for the usual Illumina sequencing. The special 10xGenomics software then aligns the pieces to Build 38 using to special bar codes to guide the alignment. The BAM file (its one file for the whole person --- 109 GIGAbytes) contains a notation (called "BX") giving the bar code sequence. The idea I has was to use 10xGenomics long reads to understand the nature of the strange mutation that is at the heart of the Clan Donald DNA project for our R1a people who are in the male line of the clan Chiefs. This mutation occurred in John, First Lord of the Isles (d. 1368) or possibly his father, Angus Og (Og means "junior"), per impeccable paper trails. This mutation is none of the usual SNPs, STRs, or even small indels that are the bread and butter of this email list. In fact it was THREE mutations called CLD56, CLD57, and CLD82 (you can guess what "CLD" means.) These are inside the very notorious DYZ19 126 (or so) base pair long repeat region. They appeared, in BigY and Full Genomes' YElite to be 1 to 2 kilobase long deletes. (The end positions are somewhat diffuse). They are completely reliably detected in any Illumina sequencing product, by looking at the number of "reads" in those areas compared to adjacent areas. This is complicated by the fact that there are numerous roughly one kilobase areas within DYZ19 which are empty of reads in everybody. But it should be obvious that the probability of THREE very unusual mutations in one person would be exceeding unusual. So I partook of the project with the long reads. The results are very gratifying though the long reads alone were not a panacea. Here's what I learned, though not necessarily in the order I learned it. First I examined the length of the long reads for various regions of the Y chromosome. The typical length of the read sets (2 to 30 reads per long molecule) peaks near 0 bases, one paired-ends read, and the probability drops off smoothly as the barcoded groups get longer. The 1/2 length is 1250, the 1/10 length is 8200, and the 1/100 point is about 27000. However, in the DYZ19 region the typical length is over 10,000, and across the region exactly bracketing the three CLD markers, almost all of then are over 50,000. In addition almost all the long read bracketing the three mutations start exactly at the start of one and end at the end of the third. These are the smoking guns that Build 38 is simply wrong in that area. In addition almost all the long read bracketing the three mutations start exactly at the start of one and end at the end of the third. But the long reads alone cannot tell us what is right. Pacbio's long read product is too expensive. This chops out pieces of dna strongly clustered around 10,000 bases long and sequences the whole piece. These are then assembled de-novo, albeit with some difficulty because each read has about a 5-10% error rate. If you generate enough coverage the consensus will be right. So I discovered that Pacbio had generated two whole genomes of males (they seem to prefer females) that have already been assembled (There is an even better one in progress.) So what I did was regenerate the FASTQ files from my Chromium BAM file (I'm R1a) as well as an R1b one I got from Full Genomes. Note that the software that generated the regular Chromium BAM files assembled against Build 38 generated a small (~20,000 bases) auxiliary file for the Y chromosome. I ascertained by BLASTing it against both the NCBI database and the Pacbio reference that it was likely from the DYZ19 area, so I added to to the FASTQ files. There are two FASTQ files, one for each end of each paired read. I then assembled these to the Pacbio reference using BWA. Comparing the view of the two new BAM files in IGV it was clear that what had been three different deletes in the plain Illumina assembly against Build 38 had become one single delete. This can be anywhere from 10100 to 13500 bases long. Its inexact because one end is in an area where there are exact 126 base repeats, so assigning Illumina to Pacbio can't help, only looking at real Pacbio reads of me can help. The little piece that Build38 has unassigned was assigned with no trouble against the Pacbio reference. Thus the bottom line is that these new methods can indeed help with strange large variants, but Pacbio is best ... and still too expensive. Doug McDonald