I'd like to apply a pyranges function on each row of a pyranges object. I then return a row that contains that region, and another region (a so-called "hit") which meets the criteria, whatever that is.

For example, I start with a pyranges object called A, read in from a BED file called A.bed:

A = pyranges.read_bed("A.bed")

I merge it to another object called M.

M = A.merge()

I'd like to run/apply a function on each row of M, which intersects back with A.

For instance, I want to get the maximum-scoring interval in A, among those intervals in A which fall in a merged row in M. (Intervals in A are disjoint.)

Here is one approach I tried, which is very slow:

#!/usr/bin/env python

import sys
import pyranges as pr
import pandas as pd
import numpy as np

in_fn = sys.argv[1]

bed_data = pr.read_bed(in_fn)
bed_merged = bed_data.merge()

def max_scoring_element_over_merged_elements(df):
    new_df = pd.DataFrame()
    for i in range(len(df)):
        pd_row = pd.DataFrame(df.iloc[i]).T.reset_index(drop=True)
        pr_row = pr.from_dict(pd_row)
        candidates = bed_data.intersect(pr_row)
        max_score = np.max(candidates.Score)
        hit = candidates[(candidates.Score == max_score)].head(1) # grab one row, in case of ties
        hit.df.columns = ["Hit{}".format(x) for x in hit.df.columns]
        pd_row = pd.concat([pd_row, hit.df], axis=1)
        new_df = new_df.append(pd_row)
    return new_df

res = bed_merged.apply(max_scoring_element_over_merged_elements)
print(res)

This works on small files, but on anything like a typical input, this takes a very long time to run.

Using bedmap/bedops, for instance, this literally takes a fraction of a second to run on the command line:

$ bedmap --echo --max-element <(bedops --merge A.bed) A.bed > answer.bed

But the Python script above is using 100% CPU and takes at least longer than ten minutes (at which point I canceled it).

Is there a Pythonic/pyrange-ic/idiomatic way to efficiently do per-row operations with pyranges? Thanks!



Source link