Class to downsample a BAM file while respecting that we should either get rid of both ends of a pair or neither end of the pair. In addition, this program uses the read-name and extracts the position within the tile whence the read came from. The downsampling is based on this position. Results with the exact same input will produce the same results. Note 1: This is technology and read-name dependent. If your read-names do not have coordinate information, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this will not work properly. This has been designed with Illumina MiSeq/HiSeq in mind. Note 2: The downsampling is not random. It is deterministically dependent on the position of the read within its tile. Note 3: Downsampling twice with this program is not supported. Note 4: You should call MarkDuplicates after downsampling. Finally, the code has been designed to simulate sequencing less as accurately as possible, not for getting an exact downsample fraction. In particular, since the reads may be distributed non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input argument FRACTION.
No Tags found
No Biostars posts found