java-genomics-toolkit

about.

This is a collection of applications for genomics data processing, primarily high-throughput next-generation sequencing. There is a focus on processing data in Wiggle format, since many other tools are available for SAM/BAM (1,2), Bed (1,2), Fastq, etc. Wiggle/BigWig formats provide a compact way to store numerical data resulting from ChIP-seq, MNase-seq, FAIRE-seq, and DNase-seq experiments. This toolkit provides applications for adding, subtracting, dividing, multiplying, log-transforming, averaging, Z-scoring, and smoothing Wig files. There are also tools for performing analysis of MNase-seq (nucleosome mapping) data, creating heatmaps, and averaging the values from aligned loci.

Tools may be run from the terminal or from Galaxy.

All tools are designed to process data in chunks so that memory requirements never exceed ~1GB, regardless of genome size. Tools are intended to be modular, so that multiple tools can easily be strung together into ad hoc pipelines or workflows in Galaxy. For example, a common pipeline for our ChIP-seq experiments is: 1) map reads with bowtie, 2) calculate coverage of sequencing reads, 3) normalize by subtracting input, 4) Z-score the normalized coverage, 5) correlate replicates, 6) average multiple replicates, and 7) make a heatmap of the final result.

tools.

For an up-to-date list of available tools, search for java-genomics-toolkit in the Galaxy Tool Shed.

converters.

IntervalToWig: convert valued Bed/BedGraph/GFF interval data (i.e. from microarrays) to Wiggle format
IntervalToBed: convert valued BigBed/BedGraph/GFF/VCF/SAM/BAM reads or intervals to Bed format
RomanNumeralize: convert chromosome names to Roman Numerals (i.e. chr12 to chrXII)
GeneTrackToWig: convert GeneTrack formatted data to Wiggle format
GeneTrackToBedGraph: convert GeneTrack formatted data to BedGraph format
InterpolateWig: interpolate missing values in a (Big)Wig file (useful for making continuous tracks from microarray data)
FastqIlluminaToSanger: quickly convert Illumina-encoded Phred scores in a FASTQ file to Sanger format

dna.

DNAPropertyCalculator: calculate sequence-specific DNA properties, such as GC content, Roll, Twist, etc.
FindNMers: find matches to a short sequence in a reference genome

ngs.

Autocorrelation: compute the autocorrelation function of Wiggle data intervals
BaseAlignCounts: generate occupancy tracks in Wiggle format from mapped reads in SAM/BAM/Bed format
FindAbsoluteMaxima: finds absolute maxima of Wiggle data in intervals
FindOutlierRegions: find potential CNVs or deletions from sequencing coverage data
IntervalLengthDistribution: computes a histogram of interval/read lengths
IntervalStats: computes mean/min/max values of Wiggle data over a set of intervals
PowerSpectrum: computes the power spectrum of Wiggle data over a set of intervals
RollingReadLength: generates a Wiggle track with the mean read length over each base pair
ReadLengthDistributionMatrix: examine paired-end read length distribution across an interval of the genome
WaveletTransform: compute a Wavelet scaleogram across an interval of the genome
ExtractDataFromRegion: get data for a region from (Big)Wig file(s)

nucleosomes.

FindBoundaryNucleosomes: finds the 5' and 3' nucleosomes for a set of intervals
GreedyCaller: calls stereotypic nucleosome positions with a greedy algorithm
MapDyads: maps read centers / approximated nucleosome positions into a Wiggle track
Phasogram: computes the distribution of read phases (distance between reads)
PercusDecomposition: compute a Percus energy decomposition from nucleosome data
PredictFAIRESignal: attempt to model FAIRE experiment data from nucleosome occupancy
DynaPro: solve single-particle statistical mechanics equilibria using a dynamic programming algorithm

visualization.

IntervalAverager: computes the average profile for a set of aligned intervals
KMeans: k-means clustering of a matrix of data
MatrixAligner: aligns intervals of Wiggle data into a matrix for visualization with matrix2png
StripMatrix: strips the headers off of a matrix2png-format matrix for import into Matlab/R

wigmath.

Add: add multiple (Big)Wig files
Subtract: subtract two (Big)Wig files
Multiply: multiply (Big)Wig files together
Divide: divide two (Big)Wig files
Average: average multiple (Big)Wig files
ZScore: a (Big)Wig file
LogTransform: log-transform a (Big)Wig file
MovingAverageSmooth: smooth a (Big)Wig file with a box (moving average) kernel
GaussianSmooth: smooth a (Big)Wig file with a Gaussian kernel
Scale: scale a (Big)Wig file by a normalization constant
MeanShift: a (Big)Wig file to have a specified mean
Correlate: multiple (Big)Wig files (Pearson/Spearman)
ValueDistribution: compute a histogram of values in a (Big)Wig file
Summary: generate summary descriptive statistics about a (Big)Wig file
Downsample: a (Big)Wig file into windows

usage.

galaxy.

One-click installation is available for your local Galaxy instance through the Galaxy Tool Shed.

If you run a production Galaxy server, configuration files are provided for loading the applications into Galaxy manually. Unzip or check out the java-genomics-toolkit distribution into Galaxy's "tools" folder, and add the supplied tool_conf entries to your tool_conf.xml file.

shell.

Tools can also be run on the terminal, and helper scripts are provided for convenience. For more information and usage examples, see the GitHub page.

requirements.

java-genomics-toolkit requires Java 7, available at oracle.com.

download.

The recommended way to obtain the toolkit is to check out the source code from GitHub and build it using the provided Ant build script (simply call "ant").

In addition, precompiled, ready-to-use packages that include the JRE v7 are available for Linux platforms in x32 and x64 flavors. If you want to try out the toolkit, this may be the quickest option.

to.do.

Parallelize: explore potential performance benefits of parallelizing the computation across multiple processes / cluster nodes

java-genomics-io

Those wishing to write their own scripts may be interested in java-genomics-io, the library upon which these applications are built. This library supports iterating or querying for data from Bed, BedGraph, GeneTrack, GFF, SAM, BAM, Wiggle, BigWig, and BigBed files with a consistent interface. ASCII files are indexed with Tabix as needed to perform queries efficiently. Writers are also available for writing Bed, BedGraph, GFF, SAM, BAM, and Wig files.

licensing.info.

java-genomics-toolkit is distributed under the GNU General Public License v3. See the included license.txt for more details.

java-genomics-toolkit was created by Timothy Palpant for work in the Lieb laboratory at UNC Chapel Hill.

java-genomics-toolkit utilizes multiple external libraries including:

java-genomics-io to parse genomic data files
BigWig/BigBed Reader (by the Broad Institute) to read BigWig/BigBed files
Picard / SAM-JDK for reading SAM/BAM files
JCommander for processing command-line arguments
JTransforms for FFT
log4j for logging
commons-lang for String and Array manipulation
commons-math for mathematics

contact.me.

Please contact me at tim [at] palpant.us with bugs, questions, comments, or suggestions.