P. abies v1.0 promoter sequences
Promoter sequences can be downloaded  from the ConGenIE FTP.

Let’s take a closer look at the promoter sequence file (Pabies-2000-upstream-regions.fa.gz). It contains 57,970 sequences out of 70,736 total genes. 57,970 upstream sequences include High Confidence (24,397), Medium Confidence (25,147), and Low confidence (8,426 ) genes. 34% of the sequences (19,982) have a regulatory regins shorter than 2000bp while the remainder (38,808/57,970) are exactly 2000bp.

The gene annotation contains a total of 70,736 CDS sequences, of which 58,213 CDS sequences begin with a start codon (ATG). CDS sequences (58,213) comprise of  High Confidence (24,476), Medium Confidence (25,273), and Low Confidence(8,464) genes .

Now we need to extract upstream sequences that contain CDS start codon. In other words, the 
intersection of the above two paragraphs. Finally, we can find 57,970 sequences, as shown in the venn diagram. 

Therefore, promoter sequence file(Pabies-2000-upstream-regions.fa.gz) contains set of upstream sequences that are having CDS start and end codons.  Here are the results of venn diagram.

Frequently Asked Questions?

Why can’t we extract a regulatory region for all 70736 genes
12,766 (70,736-57,970) sequence are missing from 2000bp/5000bp upstream sequence file this could be due to one of the following reason.
  1. When positive stranded gene starts from 1st position or CDS doesn't contain start and end codons (8,686).
  1. When minus stranded gene stop is equal or lower than upstream sequence length (4,080).
  1. When total scaffold length is lower than upstream sequence length.
Do all gene models have a CDS that contains a start codon?
Yes, As shown in the venn diagram all upstream sequences contain CDS start and end codons.

Are all 2000bp  regulatory  sequences exactly 2000bp in length?
No, 2000 is the maximum length for upstream sequence. Sometimes it could be less than think if either the scaffold is not long enough or the regulatory region runs into another gene. 

These results can be reproduced using the following awk commands.
#count the number of sequences in fasta file
grep -c ">" 
#Extract number of sequences by length
awk '!/^>/ {next;}{getline sequence}(length(sequence)< 2000) {print $0":"length(sequence) "\n" sequence }'| grep -c ">"
#Check the number of sequences that are having start codon
awk '!/^>/ {next;}{getline sequence}(substr(sequence,(length(sequence)-3),3)=="ATG") {print $0 }' | grep -c ">"
#Extract the sequences by given ids(inuput_ids.txt)
cut -c 1- input_ids.txt | xargs -n 1 samtools faidx sequnce_file.fa

Please contact us, If you find more questions.