gravatar for telroyjatter

2 hours ago by


I'm working with PLINK files, trying to create a matrix where rows are samples, columns are mutations, and values are numbers of minor alleles (i.e., 0, 1, or 2). In my dataset there are ~2000 samples and ~6M SNPs. I used recodeAD on my bed file to generate a .raw file, and I used the python package pandas to import it, but I couldn't even import the first row because it has 6M columns and I ran out of memory. I used recode A-transpose to generate a .traw file that I could import the first 10 rows very easily (since the dimensionality now has 2000 columns instead of 6M), but importing the full file is proving tedious. I used a loop to import chunks using the chunksize parameter, but this is taking incredibly long to run.

I normally use the parquet file format when I work with large datasets because it loads column-wise and is very fast, takes up less disk, and is less of a load on RAM. Is there a way to convert the .raw or .traw files into parquet format without having to load them into RAM?

Source link