bcftbx.FASTQFile
¶
A set of classes for reading through FASTQ files and manipulating the data within them:
- FastqIterator: enables looping through all read records in FASTQ file
- FastqRead: provides access to a single FASTQ read record
- SequenceIdentifier: provides access to sequence identifier info in a read
- FastqAttributes: provides access to gross attributes of FASTQ file
Additionally there are a few utility functions:
- get_fastq_file_handle: return a file handled opened for reading a FASTQ file
- nreads: return the number of reads in a FASTQ file
- fastqs_are_pair: check whether two FASTQs form an R1/R2 pair
Information on the FASTQ file format: http://en.wikipedia.org/wiki/FASTQ_format
-
class
bcftbx.FASTQFile.
FastqAttributes
(fastq_file=None, fp=None)¶ Class to provide access to gross attributes of a FASTQ file
Given a FASTQ file (can be uncompressed or gzipped), enables various attributes to be queried via the following properties:
nreads: number of reads in the FASTQ file fsize: size of the file (in bytes)
-
fsize
¶ Return size of the FASTQ file (bytes)
-
nreads
¶ Return number of reads in the FASTQ file
-
-
class
bcftbx.FASTQFile.
FastqIterator
(fastq_file=None, fp=None, bufsize=102400)¶ Class to loop over all records in a FASTQ file, returning a FastqRead object for each record.
Example looping over all reads:
>>> for read in FastqIterator(fastq_file): >>> print(read)
Input FASTQ can be in gzipped format; FASTQ data can also be supplied as a file-like object opened for reading, for example:
>>> fp = io.open(fastq_file,'rt') >>> for read in FastqIterator(fp=fp): >>> print(read) >>> fp.close()
-
class
bcftbx.FASTQFile.
FastqRead
(seqid_line=None, seq_line=None, optid_line=None, quality_line=None)¶ Class to store a FASTQ record with information about a read
Provides the following properties for accessing the read data:
- seqid: the “sequence identifier” information (first line of the read record) as a SequenceIdentifier object
- sequence: the raw sequence (second line of the record)
- optid: the optional sequence identifier line (third line of the record)
- quality: the quality values (fourth line of the record)
Additional properties:
- raw_seqid: the original sequence identifier string supplied when the object was created
- seqlen: length of the sequence
- maxquality: maximum quality value (in character representation)
- minquality: minimum quality value (in character representation)
- is_colorspace: returns True if the read looks like a colorspace read, False otherwise
Note
Quality scores can only be obtained from character representations once the encoding scheme is known.
-
class
bcftbx.FASTQFile.
SequenceIdentifier
(seqid)¶ Class to store/manipulate sequence identifier information from a FASTQ record
Provides access to the data items in the sequence identifier line of a FASTQ record.
-
format
¶ Identify the format of the sequence identifier
Returns: ‘illumina18’, ‘illumina’ or None Return type: String
-
is_pair_of
(seqid)¶ Check if this forms a pair with another SequenceIdentifier
-
-
bcftbx.FASTQFile.
fastqs_are_pair
(fastq1=None, fastq2=None, verbose=True, fp1=None, fp2=None)¶ Check that two FASTQs form an R1/R2 pair
Parameters: - fastq1 – first FASTQ
- fastq2 – second FASTQ
Returns: True if each read in fastq1 forms an R1/R2 pair with the equivalent read (i.e. in the same position) in fastq2, otherwise False if any do not form an R1/R2 (or if there are more reads in one than than the other).
-
bcftbx.FASTQFile.
get_fastq_file_handle
(fastq, mode='rt')¶ Return a file handle opened for reading for a FASTQ file
Deals with both compressed (gzipped) and uncompressed FASTQ files.
Parameters: - fastq – name (including path, if required) of FASTQ file. The file can be gzipped (must have ‘.gz’ extension)
- mode – optional mode for file opening (defaults to ‘rt’)
Returns: File handle that can be used for read operations.
-
bcftbx.FASTQFile.
nreads
(fastq=None, fp=None)¶ Return number of reads in a FASTQ file
Performs a simple-minded read count, by counting the number of lines in the file and dividing by 4.
The FASTQ file can be specified either as a file name (using the ‘fastq’ argument) or as a file-like object opened for line reading (using the ‘fp’ argument).
This function can handle gzipped FASTQ files supplied via the ‘fastq’ argument.
Line counting uses a variant of the “buf count” method outlined here: http://stackoverflow.com/a/850962/579925
Parameters: - fastq – fastq(.gz) file
- fp – open file descriptor for fastq file
Returns: Number of reads