Difference between revisions of "Preparing and loading data"

From GenomeView Manual
Jump to navigation Jump to search
 
(43 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  
 +
{{TOC|align=right}}
  
=== Data preparation ===
 
  
[http://genomeview.org/content/load-data Load data tutorial]
 
  
[http://genomeview.org/content/data-formats Supported data formats]
+
There are several easy ways to load up data into GenomeView. Before you load your data, you may want to make sure you're using a supported format from the list below. Generally, GenomeView will notify you if it doesn't understand your data.
[http://genomeview.org/content/preparing-fasta-files Fasta files]
 
  
[http://genomeview.org/content/preparing-feature-files Feature files]
+
==Loading data ==
 +
You can load your data files ...
 +
* ... by selecting "work with my data" in the [[Genome Explorer]]
 +
* ... by dragging them onto GenomeView
 +
* ... by selecting the 'File' menu and then 'Load data...' ([http://genomeview.org/content/load-data tutorial])
 +
* ... by pressing CTRL+O
 +
* ... by specifying them as argument on the [[command-line use|command-line]]
 +
* ... by loading a [[session file]]
  
[http://genomeview.org/content/preparing-short-read-alignments Read data]
 
  
[http://genomeview.org/content/preparing-pileup Coverage plots]
+
You can load preloaded data ...
 +
* ... by selecting a genome from the [[Genome Explorer]]
 +
* ... following a link from a [[Integration|GenomeView enabled website]]
  
==Preparing data recipe==
+
==Data preparation recipe==
  
 
# Match identifiers: GenomeView uses the identifiers to link different sources, so make sure that the identifiers match (case-sensitive).
 
# Match identifiers: GenomeView uses the identifiers to link different sources, so make sure that the identifiers match (case-sensitive).
 
# Create indices for data files that need it (check table below)
 
# Create indices for data files that need it (check table below)
 
# Convert file formats to get desired visuals (check table below)
 
# Convert file formats to get desired visuals (check table below)
# Load data
+
# Load data (see above)
  
 +
'''Why index files?''' Indexing will create a look-up table for GenomeView to load data on-the-fly. This will will speed up browsing and loading speed, as well as significantly reduce the amount of memory you need. For some file formats we recommend you create indices, for other we do not. See the table below for more details and links to instructions.
  
<h2>Index your files!</h2>
 
<strong>Make sure you index your large files: <a href="/content/preparing-fasta-files">reference genome</a>, <a href="/content/preparing-short-read-alignments">NGS data sets (SAM / BAM)</a>, <a href="/content/preparing-feature-files">annotation</a> and <a href="/content/preparing-pileup">read coverage plots</a></strong> While this is not required, it will speed up the process of browsing and loading data, as well as significantly reduce the amount of memory you need.
 
  
 +
==Recommended file formats ==
 +
This is a list of file formats that are recommended for different data types. See the full list of data types in the section below.
 +
 +
 +
 +
<table border=1>
 +
<tr><th>Data type</th><th>Recommended file format</th><th>Instructions</th></tr>
 +
<tr><td>Reference sequence</td><td>fasta</td><td>[[Preparing reference sequence]]</td></tr>
 +
<tr><td>Annotation</td><td>GFF3</td><td>[[Preparing annotation]]</td></tr>
 +
<tr><td>Read a alignments</td><td>BAM</td><td>[[Preparing read data]]</td></tr>
 +
<tr><td>Variation</td><td>VCF</td><td>[[Preparing VCF data]]</td></tr>
 +
<tr><td>Coverage summary - continuous values</td><td>TDF</td><td>[[Preparing value data]]</td></tr>
 +
<tr><td>Whole genome alignments</td><td>MAF</td><td>[[Preparing whole genome alignments]]</td></tr>
 +
<tr><td>Genome synteny</td><td>experimental</td><td>[[Preparing genome synteny]]</td></tr>
 +
</table>
 +
 +
== Supported data formats ==
  
<h2>Input formats</h2>
+
=== Reference sequence ===
 
<table border=1>
 
<table border=1>
 
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
 
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
Line 32: Line 54:
 
<tr><th></th><th></th><th></th><th>unindexed***</th><th>indexed</th><th></th></tr>
 
<tr><th></th><th></th><th></th><th>unindexed***</th><th>indexed</th><th></th></tr>
  
<tr><td valign="top" rowspan=2>Reference sequence</td><td><b>fasta</b> <sup>¤</sup></td><td>Recommended<br/>[[Prepare FASTA]]</td><td>50 Mb</td><td>unlimited</td><td>GenomeView will automagically create index for you if you don't have one.</td></tr>
+
<tr><td valign="top" rowspan=2>Reference sequence</td><td><b>fasta</b> <sup>¤</sup></td><td>Recommended<br/>[[Index FASTA]]</td><td>50 Mb</td><td>unlimited</td><td>GenomeView will query the user create index for you if you don't have one and the file is very large.</td></tr>
  
 
<tr><td>embl, genbank</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.</td></tr>
 
<tr><td>embl, genbank</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.</td></tr>
  
<tr><td valign="top" rowspan=4>Annotation</td><td><b><a href="/content/preparing-feature-files">gff</a></b> <sup>&#164;</sup></td><td>Not recommended</td><td>50 Mb</td><td>unlimited</td><td></td></tr>
+
</table>
 +
=== Annotation ===
 +
 
 +
<table border=1>
 +
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
 +
<tr><td valign="top" rowspan=4>Annotation</td><td><b>gff</b> <sup>&#164;</sup></td><td>Not recommended
 +
[[Index GFF]]</td><td>50 Mb</td><td>unlimited</td><td></td></tr>
  
 
<tr><td>embl, genbank</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.</td></tr>
 
<tr><td>embl, genbank</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.</td></tr>
  
<tr><td>bed</td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>By default data from a bed file is added to the CDS track, if you want it in a different track, you have to add a line a the top of the file 'track name=Track_name'. No white-space is allowed in the track name.</td></tr>
+
<tr><td>bed</td><td>Not recommended [[Index BED]]</td><td>50 Mb or less</td><td>unlimited</td><td>By default data from a bed file is added to the CDS track, if you want it in a different track, you have to add a line a the top of the file 'track name=Track_name'. No white-space is allowed in the track name.</td></tr>
 
<tr><td>ptt, tbl </td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>Other standard annotation formats GenomeView understands</td></tr>
 
<tr><td>ptt, tbl </td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>Other standard annotation formats GenomeView understands</td></tr>
<tr><td>various formats</td><td></td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>GenomeView can directly parse the output of the following programs: Blast, GeneMark, TransTermHP, FindPeaks, MaqSNP, tRNA-scan</td></tr>
+
<tr><td></td><td>various formats</td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>GenomeView can directly parse the output of the following programs: Blast, GeneMark, TransTermHP, FindPeaks, MaqSNP, tRNA-scan</td></tr>
 +
</table>
  
 +
=== Whole genome alignments ===
 +
<table border=1>
 +
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
 
<tr><td valign="top" rowspan=3>Multiple genome alignment</td><td><b>maf</b> <sup>&#164;</sup></td><td>Recommended</td><td>100 Mb</td><td>unlimited</td><td>GenomeView will prompt you to create a compressed maf file and index it for you, if you're trying to load an unindexed maf file.<br/>MAF is the recommended file format for whole genome alignemnt of large/complex genomes</td></tr>
 
<tr><td valign="top" rowspan=3>Multiple genome alignment</td><td><b>maf</b> <sup>&#164;</sup></td><td>Recommended</td><td>100 Mb</td><td>unlimited</td><td>GenomeView will prompt you to create a compressed maf file and index it for you, if you're trying to load an unindexed maf file.<br/>MAF is the recommended file format for whole genome alignemnt of large/complex genomes</td></tr>
  
Line 49: Line 81:
  
 
<tr><td>aln, ClustalW</td><td>Not possible</td><td>100 Mb</td><td>--</td></tr>
 
<tr><td>aln, ClustalW</td><td>Not possible</td><td>100 Mb</td><td>--</td></tr>
 +
</table>
  
<tr><td valign="top" rowspan=2>Sequence read alignment</td><td><b><a href="/content/preparing-short-read-alignments">bam</a></b> <sup>&#164;</sup></td><td>Required</td><td>--</td><td>unlimited</td><td>GenomeView will prompt you if there is no index and will create one for you.</td></tr>
+
=== Read alignments ===
 +
<table border=1>
 +
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
 +
<tr><td valign="top" rowspan=2>Sequence read alignment</td><td><b>bam</b> <sup>&#164;</sup><br>[[Preparing read data]]</td><td>Required</td><td>--</td><td>unlimited</td><td>GenomeView will prompt you if there is no index and will create one for you. GenomeView can not automatically sort BAM files.</td></tr>
  
 
<tr><td>MAQ, MapView, BroadSolexa</td><td>Not possible</td><td>100 Mb</td><td>--</td></tr>
 
<tr><td>MAQ, MapView, BroadSolexa</td><td>Not possible</td><td>100 Mb</td><td>--</td></tr>
 +
</table>
  
<tr><td valign="top" rowspan=4>Read coverage summary</td><td><b> <a href="/content/preparing-pileup#tdf">tdf</a></b> <sup>&#164;</sup></td><td>Native</td><td>unlimited</td><td>unlimited</td><td>TDF files <a href="/content/preparing-pileup#tdf">can be created with the tdformat tool</a> that is available for <a href="https://sourceforge.net/projects/genomeview/files/TDformat/">download.</a></td></tr>
+
=== Read coverage summary - continuous value data ===
 +
<table border=1>
 +
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
 +
<tr><td valign="top" rowspan=4>Read coverage summary</td><td><b> [[tdf]]</b> <sup>&#164;</sup></td><td>Native</td><td>unlimited</td><td>unlimited</td><td>[[TDF]] files can be created with the [[bam2tdf]] tool that is available for [https://sourceforge.net/projects/genomeview/files/TDformat/ download.]</td></tr>
  
 
<tr><td>bigwig</td><td>Native</td><td>unlimited</td><td>unlimited</td><td>This format can be used for any wig file, not just read coverage</td></tr>
 
<tr><td>bigwig</td><td>Native</td><td>unlimited</td><td>unlimited</td><td>This format can be used for any wig file, not just read coverage</td></tr>
  
<tr><td> <a href="/content/preparing-pileup#pileup">pileup</a></td><td>Required</td><td>--</td><td>unlimited</td><td>The pileup format becomes slow when you have extreme read depth (>5000 x coverage)</td></tr>
+
<tr><td>[[pileup]]</td><td>Required</td><td>--</td><td>unlimited</td><td>The pileup format becomes slow when you have extreme read depth (>5000 x coverage)</td></tr>
  
<tr><td>wig</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>We strongly recommend to convert your wig files to bigwig or TDF. GenomeView can automatically convert wig files to TDF. Caveats: 'track' information should all be on a single line, 'browser' lines will be ignored as the are specific to the UCSC Genome Browser. WIG files need to be sorted by chromosome and by genomic coordinate within the chromosome. BedGraph as well as Wiggle_0 format is supported. For the wiggle_0 type, both variableStep and fixedStep should work.</td></tr>
+
<tr><td>wig</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>We strongly recommend to [[wig2tdf|convert your wig files to TDF]].  
 +
GenomeView can automatically convert wig files to TDF. Caveats: 'track' information should all be on a single line, 'browser' lines will be ignored as the are specific to the UCSC Genome Browser. WIG files need to be sorted by chromosome and by genomic coordinate within the chromosome. BedGraph as well as Wiggle_0 format is supported. For the wiggle_0 type, both variableStep and fixedStep should work.</td></tr>
 +
</table>
  
<tr><td>Allele diversity summary</td><td><b> <a href="/content/preparing-pileup#pileup">pileup</a></b> <sup>&#164;</sup></td><td>Required</td><td>--</td><td>unlimited</td><td>The pileup format becomes slow when you have extreme read depth (>5000 x coverage)</td></tr>
+
=== Genome variation and diversity ===
 +
<table border=1>
 +
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
 +
<tr><td>Genome variation</td><td><b> [[vcf]]</b> <sup>&#164;</sup></td><td>Not recommended</td><td>--</td><td>unlimited</td><td>It is recommended to run [[reducevcf]] on VCF prior to loading them, this will speed up the loading time significantly.</td></tr>
  
 
</table>
 
</table>
* Indicates whether this file format can/should be indexed. <br/>
 
  
** Recommended maximum file size. First value is without index, the second with index. This values are only guidelines. When loading multiple data sets, you should add the sizes.<br/>
 
  
*** Unindexed data files can be gzip compressed.
+
=== Allele diversity ===
 +
<table border=1>
 +
<tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr>
 +
<tr><td>Allele diversity summary</td><td><b> [[pileup]]</b> <sup>&#164;</sup></td><td>Required</td><td>--</td><td>unlimited</td><td>The pileup format becomes slow when you have extreme read depth (>5000 x coverage)</td></tr>
 +
 
 +
</table>
 +
* Indicates whether this file format can/should be indexed. <br/>
 +
** Recommended maximum file size. First value is without index, the second with index. This values are only guidelines. When loading multiple data sets, you should add the sizes.<br/>
 +
*** Unindexed data files can be gzip compressed.
  
 
<sup>&#164;</sup> Recommended file format for this data type.
 
<sup>&#164;</sup> Recommended file format for this data type.
Line 83: Line 134:
 
<h2>Converting formats</h2>
 
<h2>Converting formats</h2>
 
<a href="http://genomeview.org/loki/">We offer a few tools to convert files between formats.</a>
 
<a href="http://genomeview.org/loki/">We offer a few tools to convert files between formats.</a>
 +
 +
== Previous documentation pages ==
 +
 +
 +
 +
[http://genomeview.org/content/data-formats Supported data formats]
 +
[http://genomeview.org/content/preparing-fasta-files Fasta files]
 +
 +
[http://genomeview.org/content/preparing-feature-files Feature files]
 +
 +
[http://genomeview.org/content/preparing-short-read-alignments Read data]
 +
 +
[http://genomeview.org/content/preparing-pileup Coverage plots]

Latest revision as of 17:21, 9 January 2014


There are several easy ways to load up data into GenomeView. Before you load your data, you may want to make sure you're using a supported format from the list below. Generally, GenomeView will notify you if it doesn't understand your data.

Loading data

You can load your data files ...

  • ... by selecting "work with my data" in the Genome Explorer
  • ... by dragging them onto GenomeView
  • ... by selecting the 'File' menu and then 'Load data...' (tutorial)
  • ... by pressing CTRL+O
  • ... by specifying them as argument on the command-line
  • ... by loading a session file


You can load preloaded data ...

Data preparation recipe

  1. Match identifiers: GenomeView uses the identifiers to link different sources, so make sure that the identifiers match (case-sensitive).
  2. Create indices for data files that need it (check table below)
  3. Convert file formats to get desired visuals (check table below)
  4. Load data (see above)

Why index files? Indexing will create a look-up table for GenomeView to load data on-the-fly. This will will speed up browsing and loading speed, as well as significantly reduce the amount of memory you need. For some file formats we recommend you create indices, for other we do not. See the table below for more details and links to instructions.


Recommended file formats

This is a list of file formats that are recommended for different data types. See the full list of data types in the section below.


Data typeRecommended file formatInstructions
Reference sequencefastaPreparing reference sequence
AnnotationGFF3Preparing annotation
Read a alignmentsBAMPreparing read data
VariationVCFPreparing VCF data
Coverage summary - continuous valuesTDFPreparing value data
Whole genome alignmentsMAFPreparing whole genome alignments
Genome syntenyexperimentalPreparing genome synteny

Supported data formats

Reference sequence

Data typeFile formatIndex*Max size**Comments
unindexed***indexed
Reference sequencefasta ¤Recommended
Index FASTA
50 MbunlimitedGenomeView will query the user create index for you if you don't have one and the file is very large.
embl, genbankNot possible50 Mb--EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.

Annotation

Data typeFile formatIndex*Max size**Comments
Annotationgff ¤Not recommended Index GFF50 Mbunlimited
embl, genbankNot possible50 Mb--EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.
bedNot recommended Index BED50 Mb or lessunlimitedBy default data from a bed file is added to the CDS track, if you want it in a different track, you have to add a line a the top of the file 'track name=Track_name'. No white-space is allowed in the track name.
ptt, tbl Not possible50 Mb or less--Other standard annotation formats GenomeView understands
various formatsNot possible50 Mb or less--GenomeView can directly parse the output of the following programs: Blast, GeneMark, TransTermHP, FindPeaks, MaqSNP, tRNA-scan

Whole genome alignments

Data typeFile formatIndex*Max size**Comments
Multiple genome alignmentmaf ¤Recommended100 MbunlimitedGenomeView will prompt you to create a compressed maf file and index it for you, if you're trying to load an unindexed maf file.
MAF is the recommended file format for whole genome alignemnt of large/complex genomes
multi-fasta ¤Not possible100 Mb--Recommended for small/simple genomes with a near 1:1 relationship.
aln, ClustalWNot possible100 Mb--

Read alignments

Data typeFile formatIndex*Max size**Comments
Sequence read alignmentbam ¤
Preparing read data
Required--unlimitedGenomeView will prompt you if there is no index and will create one for you. GenomeView can not automatically sort BAM files.
MAQ, MapView, BroadSolexaNot possible100 Mb--

Read coverage summary - continuous value data

Data typeFile formatIndex*Max size**Comments
Read coverage summary tdf ¤NativeunlimitedunlimitedTDF files can be created with the bam2tdf tool that is available for download.
bigwigNativeunlimitedunlimitedThis format can be used for any wig file, not just read coverage
pileupRequired--unlimitedThe pileup format becomes slow when you have extreme read depth (>5000 x coverage)
wigNot possible50 Mb--We strongly recommend to convert your wig files to TDF. GenomeView can automatically convert wig files to TDF. Caveats: 'track' information should all be on a single line, 'browser' lines will be ignored as the are specific to the UCSC Genome Browser. WIG files need to be sorted by chromosome and by genomic coordinate within the chromosome. BedGraph as well as Wiggle_0 format is supported. For the wiggle_0 type, both variableStep and fixedStep should work.

Genome variation and diversity

Data typeFile formatIndex*Max size**Comments
Genome variation vcf ¤Not recommended--unlimitedIt is recommended to run reducevcf on VCF prior to loading them, this will speed up the loading time significantly.


Allele diversity

Data typeFile formatIndex*Max size**Comments
Allele diversity summary pileup ¤Required--unlimitedThe pileup format becomes slow when you have extreme read depth (>5000 x coverage)
* Indicates whether this file format can/should be indexed. 
** Recommended maximum file size. First value is without index, the second with index. This values are only guidelines. When loading multiple data sets, you should add the sizes.
*** Unindexed data files can be gzip compressed.

¤ Recommended file format for this data type.



Output formats

(Modified) annotations can be saved as either GFF or EMBL.

All data that is loaded can be exported in their original format. This will not include modifications.

Converting formats

<a href="http://genomeview.org/loki/">We offer a few tools to convert files between formats.</a>

Previous documentation pages

Supported data formats Fasta files

Feature files

Read data

Coverage plots