Difference between revisions of "Preparing and loading data"
Thomas Admin (talk | contribs) |
|||
Line 11: | Line 11: | ||
[http://genomeview.org/content/preparing-pileup Coverage plots] | [http://genomeview.org/content/preparing-pileup Coverage plots] | ||
+ | |||
+ | ==Preparing data recipe== | ||
+ | |||
+ | # Match identifiers: GenomeView uses the identifiers to link different sources, so make sure that the identifiers match (case-sensitive). | ||
+ | # Create indices for data files that need it (check table below) | ||
+ | # Convert file formats to get desired visuals (check table below) | ||
+ | # Load data | ||
+ | |||
+ | |||
+ | <h2>Index your files!</h2> | ||
+ | <strong>Make sure you index your large files: <a href="/content/preparing-fasta-files">reference genome</a>, <a href="/content/preparing-short-read-alignments">NGS data sets (SAM / BAM)</a>, <a href="/content/preparing-feature-files">annotation</a> and <a href="/content/preparing-pileup">read coverage plots</a></strong> While this is not required, it will speed up the process of browsing and loading data, as well as significantly reduce the amount of memory you need. | ||
+ | |||
+ | |||
+ | <h2>Input formats</h2> | ||
+ | <table border=1> | ||
+ | <tr><th>Data type</th><th>File format</th><th>Index*</th><th colspan=2>Max size**</th><th>Comments</th></tr> | ||
+ | |||
+ | <tr><th></th><th></th><th></th><th>unindexed***</th><th>indexed</th><th></th></tr> | ||
+ | |||
+ | <tr><td valign="top" rowspan=2>Reference sequence</td><td><b>[[fasta]]</a></b> <sup>¤</sup></td><td>Recommended</td><td>50 Mb</td><td>unlimited</td><td>GenomeView will automagically create index for you if you don't have one.</td></tr> | ||
+ | |||
+ | <tr><td>embl, genbank</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.</td></tr> | ||
+ | |||
+ | <tr><td valign="top" rowspan=4>Annotation</td><td><b><a href="/content/preparing-feature-files">gff</a></b> <sup>¤</sup></td><td>Not recommended</td><td>50 Mb</td><td>unlimited</td><td></td></tr> | ||
+ | |||
+ | <tr><td>embl, genbank</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time.</td></tr> | ||
+ | |||
+ | <tr><td>bed</td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>By default data from a bed file is added to the CDS track, if you want it in a different track, you have to add a line a the top of the file 'track name=Track_name'. No white-space is allowed in the track name.</td></tr> | ||
+ | <tr><td>ptt, tbl </td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>Other standard annotation formats GenomeView understands</td></tr> | ||
+ | <tr><td>various formats</td><td></td><td>Not possible</td><td>50 Mb or less</td><td>--</td><td>GenomeView can directly parse the output of the following programs: Blast, GeneMark, TransTermHP, FindPeaks, MaqSNP, tRNA-scan</td></tr> | ||
+ | |||
+ | <tr><td valign="top" rowspan=3>Multiple genome alignment</td><td><b>maf</b> <sup>¤</sup></td><td>Recommended</td><td>100 Mb</td><td>unlimited</td><td>GenomeView will prompt you to create a compressed maf file and index it for you, if you're trying to load an unindexed maf file.<br/>MAF is the recommended file format for whole genome alignemnt of large/complex genomes</td></tr> | ||
+ | |||
+ | <tr><td><b>multi-fasta</b> <sup>¤</sup></td><td>Not possible</td><td>100 Mb</td><td>--</td><td>Recommended for small/simple genomes with a near 1:1 relationship.</td></tr> | ||
+ | |||
+ | <tr><td>aln, ClustalW</td><td>Not possible</td><td>100 Mb</td><td>--</td></tr> | ||
+ | |||
+ | <tr><td valign="top" rowspan=2>Sequence read alignment</td><td><b><a href="/content/preparing-short-read-alignments">bam</a></b> <sup>¤</sup></td><td>Required</td><td>--</td><td>unlimited</td><td>GenomeView will prompt you if there is no index and will create one for you.</td></tr> | ||
+ | |||
+ | <tr><td>MAQ, MapView, BroadSolexa</td><td>Not possible</td><td>100 Mb</td><td>--</td></tr> | ||
+ | |||
+ | <tr><td valign="top" rowspan=4>Read coverage summary</td><td><b> <a href="/content/preparing-pileup#tdf">tdf</a></b> <sup>¤</sup></td><td>Native</td><td>unlimited</td><td>unlimited</td><td>TDF files <a href="/content/preparing-pileup#tdf">can be created with the tdformat tool</a> that is available for <a href="https://sourceforge.net/projects/genomeview/files/TDformat/">download.</a></td></tr> | ||
+ | |||
+ | <tr><td>bigwig</td><td>Native</td><td>unlimited</td><td>unlimited</td><td>This format can be used for any wig file, not just read coverage</td></tr> | ||
+ | |||
+ | <tr><td> <a href="/content/preparing-pileup#pileup">pileup</a></td><td>Required</td><td>--</td><td>unlimited</td><td>The pileup format becomes slow when you have extreme read depth (>5000 x coverage)</td></tr> | ||
+ | |||
+ | <tr><td>wig</td><td>Not possible</td><td>50 Mb</td><td>--</td><td>We strongly recommend to convert your wig files to bigwig or TDF. GenomeView can automatically convert wig files to TDF. Caveats: 'track' information should all be on a single line, 'browser' lines will be ignored as the are specific to the UCSC Genome Browser. WIG files need to be sorted by chromosome and by genomic coordinate within the chromosome. BedGraph as well as Wiggle_0 format is supported. For the wiggle_0 type, both variableStep and fixedStep should work.</td></tr> | ||
+ | |||
+ | <tr><td>Allele diversity summary</td><td><b> <a href="/content/preparing-pileup#pileup">pileup</a></b> <sup>¤</sup></td><td>Required</td><td>--</td><td>unlimited</td><td>The pileup format becomes slow when you have extreme read depth (>5000 x coverage)</td></tr> | ||
+ | |||
+ | </table> | ||
+ | * Indicates whether this file format can/should be indexed. <br/> | ||
+ | |||
+ | ** Recommended maximum file size. First value is without index, the second with index. This values are only guidelines. When loading multiple data sets, you should add the sizes.<br/> | ||
+ | |||
+ | *** Unindexed data files can be gzip compressed. | ||
+ | |||
+ | <sup>¤</sup> Recommended file format for this data type. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <h2>Output formats</h2> | ||
+ | (Modified) annotations can be saved as either GFF or EMBL. | ||
+ | |||
+ | All data that is loaded can be exported in their original format. This will not include modifications. | ||
+ | |||
+ | <h2>Converting formats</h2> | ||
+ | <a href="http://genomeview.org/loki/">We offer a few tools to convert files between formats.</a> |
Revision as of 21:00, 3 September 2013
Data preparation
Preparing data recipe
- Match identifiers: GenomeView uses the identifiers to link different sources, so make sure that the identifiers match (case-sensitive).
- Create indices for data files that need it (check table below)
- Convert file formats to get desired visuals (check table below)
- Load data
Index your files!
Make sure you index your large files: <a href="/content/preparing-fasta-files">reference genome</a>, <a href="/content/preparing-short-read-alignments">NGS data sets (SAM / BAM)</a>, <a href="/content/preparing-feature-files">annotation</a> and <a href="/content/preparing-pileup">read coverage plots</a> While this is not required, it will speed up the process of browsing and loading data, as well as significantly reduce the amount of memory you need.
Input formats
Data type | File format | Index* | Max size** | Comments | |
---|---|---|---|---|---|
unindexed*** | indexed | ||||
Reference sequence | fasta</a> ¤ | Recommended | 50 Mb | unlimited | GenomeView will automagically create index for you if you don't have one. |
embl, genbank | Not possible | 50 Mb | -- | EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time. | |
Annotation | <a href="/content/preparing-feature-files">gff</a> ¤ | Not recommended | 50 Mb | unlimited | |
embl, genbank | Not possible | 50 Mb | -- | EMBL and genbank are mixed file formats that can contain both annotation and reference sequence at the same time. | |
bed | Not possible | 50 Mb or less | -- | By default data from a bed file is added to the CDS track, if you want it in a different track, you have to add a line a the top of the file 'track name=Track_name'. No white-space is allowed in the track name. | |
ptt, tbl | Not possible | 50 Mb or less | -- | Other standard annotation formats GenomeView understands | |
various formats | Not possible | 50 Mb or less | -- | GenomeView can directly parse the output of the following programs: Blast, GeneMark, TransTermHP, FindPeaks, MaqSNP, tRNA-scan | |
Multiple genome alignment | maf ¤ | Recommended | 100 Mb | unlimited | GenomeView will prompt you to create a compressed maf file and index it for you, if you're trying to load an unindexed maf file. MAF is the recommended file format for whole genome alignemnt of large/complex genomes |
multi-fasta ¤ | Not possible | 100 Mb | -- | Recommended for small/simple genomes with a near 1:1 relationship. | |
aln, ClustalW | Not possible | 100 Mb | -- | ||
Sequence read alignment | <a href="/content/preparing-short-read-alignments">bam</a> ¤ | Required | -- | unlimited | GenomeView will prompt you if there is no index and will create one for you. |
MAQ, MapView, BroadSolexa | Not possible | 100 Mb | -- | ||
Read coverage summary | <a href="/content/preparing-pileup#tdf">tdf</a> ¤ | Native | unlimited | unlimited | TDF files <a href="/content/preparing-pileup#tdf">can be created with the tdformat tool</a> that is available for <a href="https://sourceforge.net/projects/genomeview/files/TDformat/">download.</a> |
bigwig | Native | unlimited | unlimited | This format can be used for any wig file, not just read coverage | |
<a href="/content/preparing-pileup#pileup">pileup</a> | Required | -- | unlimited | The pileup format becomes slow when you have extreme read depth (>5000 x coverage) | |
wig | Not possible | 50 Mb | -- | We strongly recommend to convert your wig files to bigwig or TDF. GenomeView can automatically convert wig files to TDF. Caveats: 'track' information should all be on a single line, 'browser' lines will be ignored as the are specific to the UCSC Genome Browser. WIG files need to be sorted by chromosome and by genomic coordinate within the chromosome. BedGraph as well as Wiggle_0 format is supported. For the wiggle_0 type, both variableStep and fixedStep should work. | |
Allele diversity summary | <a href="/content/preparing-pileup#pileup">pileup</a> ¤ | Required | -- | unlimited | The pileup format becomes slow when you have extreme read depth (>5000 x coverage) |
- Indicates whether this file format can/should be indexed.
- Recommended maximum file size. First value is without index, the second with index. This values are only guidelines. When loading multiple data sets, you should add the sizes.
- Recommended maximum file size. First value is without index, the second with index. This values are only guidelines. When loading multiple data sets, you should add the sizes.
- Unindexed data files can be gzip compressed.
¤ Recommended file format for this data type.
Output formats
(Modified) annotations can be saved as either GFF or EMBL.
All data that is loaded can be exported in their original format. This will not include modifications.
Converting formats
<a href="http://genomeview.org/loki/">We offer a few tools to convert files between formats.</a>