BitMagic-C++
xsample07.cpp File Reference

Example: Use of bvector<> for k-mer fingerprint K should be short, no minimizers here. More...

#include <assert.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <map>
#include <algorithm>
#include <utility>
#include <future>
#include <thread>
#include <mutex>
#include <atomic>
#include "bm64.h"
#include "bmalgo.h"
#include "bmserial.h"
#include "bmaggregator.h"
#include "bmsparsevec_compr.h"
#include "bmsparsevec_algo.h"
#include "bmundef.h"
#include "bmdbg.h"
#include "bmtimer.h"
#include "dna_finger.h"
#include "cmd_args.h"
Include dependency graph for xsample07.cpp:

Go to the source code of this file.

Data Structures

class  SortCounting_JobFunctor< BV >
 Functor to process job batch (task). More...
class  Counting_JobFunctor< DNA_Scan >
 k-mer counting job functor class using bm::aggregator<> More...

Typedefs

typedef std::vector< char > vector_char_type
typedef DNA_FingerprintScanner< bm::bvector<> > dna_scanner_type
typedef bm::sparse_vector< unsigned, bm::bvector<> > sparse_vector_u32
typedef bm::rsc_sparse_vector< unsigned, sparse_vector_u32rsc_sparse_vector_u32
typedef std::map< unsigned, unsigned > histogram_map_u32

Functions

std::atomic_ullong k_mer_progress_count (0)
static int load_FASTA (const std::string &fname, vector_char_type &seq_vect)
 really simple FASTA parser (one entry per file)
bool get_DNA_code (char bp, bm::id64_t &dna_code)
bool get_kmer_code (const char *dna, size_t pos, unsigned k_size, bm::id64_t &k_mer)
 Calculate k-mer as an unsigned long integer.
char int2DNA (unsigned code)
 Translate integer code to DNA letter.
void translate_kmer (std::string &dna, bm::id64_t kmer_code, unsigned k_size)
 Translate k-mer code into ATGC DNA string.
void validate_k_mer (const char *dna, size_t pos, unsigned k_size, bm::id64_t k_mer)
 QA function to validate if reverse k-mer decode gives the same string.
template<typename VECT>
void sort_unique (VECT &vect)
 Auxiliary function to do sort+unique on a vactor of ints removes duplicate elements.
template<typename VECT, typename COUNT_VECT>
void sort_count (VECT &vect, COUNT_VECT &cvect)
 Auxiliary function to do sort+unique on a vactor of ints and save results in a counts vector.
template<typename BV>
void generate_k_mer_bvector (BV &bv, const vector_char_type &seq_vect, unsigned k_size, bool check)
 This function turns each k-mer into an integer number and encodes it in a bit-vector (presense vector) The natural limitation here is that integer has to be less tha 48-bits (limitations of bm::bvector<>) This method build a presense k-mer fingerprint vector which can be used for Jaccard distance comparison.
void count_kmers (const vector_char_type &seq_vect, unsigned k_size, rsc_sparse_vector_u32 &kmer_counts)
 k-mer counting algorithm using reference sequence, regenerates k-mer codes, sorts them and counts
template<typename BV>
void count_kmers_parallel (const BV &bv_kmers, const vector_char_type &seq_vect, rsc_sparse_vector_u32 &kmer_counts, unsigned k_size, unsigned concurrency)
 MT k-mer counting.
template<typename BV>
void count_kmers (const BV &bv_kmers, rsc_sparse_vector_u32 &kmer_counts)
 k-mer counting method using Bitap algorithm for occurence search this method is significantly slower than direct regeneration of k-mer codes and sorting count
template<typename BV>
void count_kmers_parallel (const BV &bv_kmers, rsc_sparse_vector_u32 &kmer_counts, unsigned concurrency)
 Runs k-mer counting in parallel.
static void compute_kmer_histogram (histogram_map_u32 &hmap, const rsc_sparse_vector_u32 &kmer_counts)
 Compute a map of how often each k-mer frequency is observed in the k-mer counts vector.
static void report_hmap (const string &fname, const histogram_map_u32 &hmap)
 Save TSV report of k-mer frequences (reverse sorted, most frequent k-mers first).
template<typename BV>
void compute_frequent_kmers (BV &frequent_bv, const histogram_map_u32 &hmap, const rsc_sparse_vector_u32 &kmer_counts, unsigned percent, unsigned k_size)
 Create vector, representing subset of k-mers of high frequency.
int main (int argc, char *argv[])

Variables

std::string ifa_name
std::string ikd_name
std::string ikd_counts_name
std::string kh_name
std::string ikd_rep_name
std::string ikd_freq_name
bool is_diag = false
bool is_timing = false
bool is_bench = false
unsigned ik_size = 8
unsigned parallel_jobs = 4
unsigned f_percent = 5
bm::chrono_taker ::duration_map_type timing_map
dna_scanner_type dna_scanner

Detailed Description

Example: Use of bvector<> for k-mer fingerprint K should be short, no minimizers here.

Definition in file xsample07.cpp.

Typedef Documentation

◆ dna_scanner_type

Examples
xsample07.cpp.

Definition at line 100 of file xsample07.cpp.

◆ histogram_map_u32

typedef std::map<unsigned, unsigned> histogram_map_u32
Examples
xsample07.cpp.

Definition at line 103 of file xsample07.cpp.

◆ rsc_sparse_vector_u32

Examples
xsample07.cpp.

Definition at line 102 of file xsample07.cpp.

◆ sparse_vector_u32

Definition at line 101 of file xsample07.cpp.

◆ vector_char_type

typedef std::vector<char> vector_char_type
Examples
xsample07.cpp.

Definition at line 99 of file xsample07.cpp.

Function Documentation

◆ compute_frequent_kmers()

template<typename BV>
void compute_frequent_kmers ( BV & frequent_bv,
const histogram_map_u32 & hmap,
const rsc_sparse_vector_u32 & kmer_counts,
unsigned percent,
unsigned k_size )

Create vector, representing subset of k-mers of high frequency.

Parameters
frequent_bv[out]- bit-vector of frequent k-mers (subset of all k-mers)
hmap- histogram map of all k-mers
kmer_counts- kmer frequency(counts) vector
percent- percent of frequent k-mers to build a subset (5%) percent here is of total number of k-mers (not percent of all occurences)
k_size- K mer size
Examples
xsample07.cpp.

Definition at line 905 of file xsample07.cpp.

References bm::bvector< Alloc >::count(), bm::sparse_vector_scanner< SV, S_FACTOR >::find_eq(), bm::rsc_sparse_vector< Val, SV >::get(), bm::rsc_sparse_vector< Val, SV >::get_null_bvector(), and bm::bvector< Alloc >::iterator_base::valid().

Referenced by main().

◆ compute_kmer_histogram()

void compute_kmer_histogram ( histogram_map_u32 & hmap,
const rsc_sparse_vector_u32 & kmer_counts )
static

Compute a map of how often each k-mer frequency is observed in the k-mer counts vector.

Parameters
hmap- [out] histogram map
kmer_counts- [in] kmer counts vector
Examples
xsample07.cpp.

Definition at line 859 of file xsample07.cpp.

References bm::bvector< Alloc >::first(), bm::rsc_sparse_vector< Val, SV >::get(), and bm::rsc_sparse_vector< Val, SV >::get_null_bvector().

Referenced by main().

◆ count_kmers() [1/2]

template<typename BV>
void count_kmers ( const BV & bv_kmers,
rsc_sparse_vector_u32 & kmer_counts )

k-mer counting method using Bitap algorithm for occurence search this method is significantly slower than direct regeneration of k-mer codes and sorting count

Definition at line 653 of file xsample07.cpp.

References dna_scanner, ik_size, bm::rsc_sparse_vector< Val, SV >::set(), and translate_kmer().

◆ count_kmers() [2/2]

void count_kmers ( const vector_char_type & seq_vect,
unsigned k_size,
rsc_sparse_vector_u32 & kmer_counts )
inline

k-mer counting algorithm using reference sequence, regenerates k-mer codes, sorts them and counts

Examples
xsample07.cpp.

Definition at line 408 of file xsample07.cpp.

References get_DNA_code(), get_kmer_code(), and sort_count().

Referenced by count_kmers_parallel(), and count_kmers_parallel().

◆ count_kmers_parallel() [1/2]

template<typename BV>
void count_kmers_parallel ( const BV & bv_kmers,
const vector_char_type & seq_vect,
rsc_sparse_vector_u32 & kmer_counts,
unsigned k_size,
unsigned concurrency )

MT k-mer counting.

Examples
xsample07.cpp.

Definition at line 594 of file xsample07.cpp.

References count_kmers(), ik_size, and bm::rank_range_split().

Referenced by main().

◆ count_kmers_parallel() [2/2]

template<typename BV>
void count_kmers_parallel ( const BV & bv_kmers,
rsc_sparse_vector_u32 & kmer_counts,
unsigned concurrency )

Runs k-mer counting in parallel.

Definition at line 781 of file xsample07.cpp.

References count_kmers(), dna_scanner, k_mer_progress_count(), and bm::rank_range_split().

◆ generate_k_mer_bvector()

template<typename BV>
void generate_k_mer_bvector ( BV & bv,
const vector_char_type & seq_vect,
unsigned k_size,
bool check )

This function turns each k-mer into an integer number and encodes it in a bit-vector (presense vector) The natural limitation here is that integer has to be less tha 48-bits (limitations of bm::bvector<>) This method build a presense k-mer fingerprint vector which can be used for Jaccard distance comparison.

Parameters
bv- [out] - target bit-vector
seq_vect- [out] DNA sequence vector
k-size- dimention for k-mer generation
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 306 of file xsample07.cpp.

References bm::BM_SORTED, get_DNA_code(), get_kmer_code(), sort_unique(), timing_map, and validate_k_mer().

Referenced by main().

◆ get_DNA_code()

bool get_DNA_code ( char bp,
bm::id64_t & dna_code )
inline

◆ get_kmer_code()

bool get_kmer_code ( const char * dna,
size_t pos,
unsigned k_size,
bm::id64_t & k_mer )
inline

Calculate k-mer as an unsigned long integer.

Returns
true - if k-mer is "true" (not 'NNNNNN')
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 165 of file xsample07.cpp.

References get_DNA_code().

Referenced by count_kmers(), generate_k_mer_bvector(), and SortCounting_JobFunctor< BV >::operator()().

◆ int2DNA()

char int2DNA ( unsigned code)
inline

Translate integer code to DNA letter.

Definition at line 192 of file xsample07.cpp.

Referenced by translate_kmer(), and validate_k_mer().

◆ k_mer_progress_count()

◆ load_FASTA()

int load_FASTA ( const std::string & fname,
vector_char_type & seq_vect )
static

really simple FASTA parser (one entry per file)

Definition at line 116 of file xsample07.cpp.

References timing_map.

Referenced by main().

◆ main()

◆ report_hmap()

void report_hmap ( const string & fname,
const histogram_map_u32 & hmap )
static

Save TSV report of k-mer frequences (reverse sorted, most frequent k-mers first).

Examples
xsample07.cpp.

Definition at line 881 of file xsample07.cpp.

Referenced by main().

◆ sort_count()

template<typename VECT, typename COUNT_VECT>
void sort_count ( VECT & vect,
COUNT_VECT & cvect )

Auxiliary function to do sort+unique on a vactor of ints and save results in a counts vector.

Examples
xsample07.cpp.

Definition at line 268 of file xsample07.cpp.

Referenced by count_kmers(), and SortCounting_JobFunctor< BV >::operator()().

◆ sort_unique()

template<typename VECT>
void sort_unique ( VECT & vect)

Auxiliary function to do sort+unique on a vactor of ints removes duplicate elements.

Examples
xsample07.cpp.

Definition at line 256 of file xsample07.cpp.

Referenced by generate_k_mer_bvector().

◆ translate_kmer()

void translate_kmer ( std::string & dna,
bm::id64_t kmer_code,
unsigned k_size )
inline

Translate k-mer code into ATGC DNA string.

Parameters
dna- target string
k_mer- k-mer code
k_size-
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 207 of file xsample07.cpp.

References int2DNA().

Referenced by count_kmers(), and Counting_JobFunctor< DNA_Scan >::operator()().

◆ validate_k_mer()

void validate_k_mer ( const char * dna,
size_t pos,
unsigned k_size,
bm::id64_t k_mer )
inline

QA function to validate if reverse k-mer decode gives the same string.

Examples
xsample07.cpp.

Definition at line 224 of file xsample07.cpp.

References int2DNA().

Referenced by generate_k_mer_bvector().

Variable Documentation

◆ dna_scanner

dna_scanner_type dna_scanner
Examples
xsample07.cpp.

Definition at line 109 of file xsample07.cpp.

Referenced by count_kmers(), count_kmers_parallel(), and main().

◆ f_percent

unsigned f_percent = 5
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 91 of file xsample07.cpp.

Referenced by main().

◆ ifa_name

std::string ifa_name

Definition at line 80 of file xsample07.cpp.

◆ ik_size

unsigned ik_size = 8

◆ ikd_counts_name

std::string ikd_counts_name
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 82 of file xsample07.cpp.

Referenced by main().

◆ ikd_freq_name

std::string ikd_freq_name
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 85 of file xsample07.cpp.

Referenced by main().

◆ ikd_name

std::string ikd_name
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 81 of file xsample07.cpp.

Referenced by main().

◆ ikd_rep_name

std::string ikd_rep_name
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 84 of file xsample07.cpp.

Referenced by main().

◆ is_bench

bool is_bench = false

Definition at line 88 of file xsample07.cpp.

◆ is_diag

bool is_diag = false

Definition at line 86 of file xsample07.cpp.

◆ is_timing

bool is_timing = false

Definition at line 87 of file xsample07.cpp.

◆ kh_name

std::string kh_name
Examples
xsample07.cpp, and xsample07a.cpp.

Definition at line 83 of file xsample07.cpp.

Referenced by main().

◆ parallel_jobs

unsigned parallel_jobs = 4

Definition at line 90 of file xsample07.cpp.

◆ timing_map

bm::chrono_taker ::duration_map_type timing_map

Definition at line 108 of file xsample07.cpp.