I do see a one-base snippet:

% echo -e "ND_123.1t74801t76802" | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801
ND_123.1    76801   76802

If we treat ND_123.1:74801-76802 as a zero-based-indexed interval, then it is 2001 bases long:

% calc '76802-74801'
    2001

It won't divide into 1kb windows neatly, so something is up with how this interval is generated.

One cheap way to fix this is to use -x with --chop <N>. This extra option filters out intervals that are not N bases long:

% echo -e "ND_123.1t74801t76802" | bedops --chop 1000 -x -
ND_123.1    74801   75801
ND_123.1    75801   76801

But that is hiding the problem. I wouldn't use this.

Another is to perhaps pre-process these intervals so that they are sized in 1k units:

% echo -e "ND_123.1t74801t76802" | awk -v FS="t" -v OFS="t" '{ print $1, $2, ($3 - 1) }' | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801

But that just hides the problem in a different way.

Another way would be to look at what the awk part of the one-liner is doing:

% gff2bed < genes.gff | awk -v FS="t" -v OFS="t" '($6 == "+"){ print $1, ($2 - 1), $2, $4, $5, $6; }'

The gff2bed script changes GFF's one-based index to zero-based in the BED-formatted output. So the above should be a correct way of getting a set of single-base, forward-stranded TSS intervals.

In any case, let's say our TSS of interest is at 0-indexed position ND_123.1:76801:76802. This TSS starts at position 76801 on contig `ND_123.1, is one base in length, and is oriented on the forward strand.

We want the 2kb-sized window upstream of this position, oriented on the forward strand.

There are two different sets of windows we can construct.

One window does not include the TSS interval:

% echo -e "ND_123.1t76801t76802" | bedops -u --range -2000:-1 -                       
ND_123.1    74801   76801

This is the solution you post in your question:

... | bedops --range -$WINDOW:-1 --everything - | ...

The other window includes the TSS interval:

% echo -e "ND_123.1t76801t76802" | bedops -u --range -1999:0 -                       
ND_123.1    74802   76802

Either way, both of these windows are 2kb long. Chopping either of these by 1kb increments gives exactly two 1kb windows:

% echo -e "ND_123.1t76801t76802" | bedops -u --range -1999:0 - | bedops --chop 1000 -
ND_123.1    74802   75802
ND_123.1    75802   76802

% echo -e "ND_123.1t76801t76802" | bedops -u --range -2000:-1 - | bedops --chop 1000 -
ND_123.1    74801   75801
ND_123.1    75801   76801

So which of these two solutions you pick depends upon whether you want the TSS interval in the output, or not. That will depend on what you're doing with these 1kb windows, probably.

Note that the --range operation will need to be flipped in the case of reverse-strand oriented TSSs. If we treat the example as reverse strand-oriented, then we might do either of the following:

% echo -e "ND_123.1t76801t76802" | bedops -u --range 1:2000 -    
ND_123.1    76802   78802

% echo -e "ND_123.1t76801t76802" | bedops -u --range 0:1999 - 
ND_123.1    76801   78801

Which parameters are used depends on whether you want to exclude or include the TSS, respectively.

Tl;dr: If what I posted in a previous answer about making specifically-sized windows is different from the above, then my previous answer is incorrect and should not be used. However, the answer that greyman refers to in his comment will include the TSS and genomic space both upstream and downstream of the TSS. It might be correct in that use case, such as for calculating aggregate signal over the TSS, where it can make sense to define 400 bases upstream, one base for the TSS, and 50 bases downstream. In that case, one might line up signal vectors and take the mean (or some other statistic), and it doesn't really matter in that case that there is an "extra" base. What matters here — for your use case — is that you need a precise size of windows for chopping up, so your solution will work.



Source link