Greg informed me that the Perl scripts are there just for reference and he had to make many changes to it for the paper. The scripts he used for the paper are gone because it was stored on scratch space. So I decided to dump the Perl scripts and start from scratch. According to the paper, "CG and CHG DMRs were required to have a minimum of 3 symmetrical CG or CHG sites, at least 2X coverage, and a minimum methylation difference between two genotypes of 60%. CHH sequence context DMRs were required to have a minimum of 6 asymmetrical CHH sites, a minimum of 2X coverage, and one genotype with <5% methylation and the other genotype with >25% methylation." Luckily for me, the tab file output from the zed-align has all of these information. The tab file contains the number of CG, number of Methylated CG, CG ratio, etc. With these information, I can just parse the file line by line to get their methylation. Then, I can compare it with the methylation of another tab file to determine if the region is differentally methylated. After a while, I completed the script. To my surprise, my script uses only 3.2 MB of ram and finish parsing through 2 1.5 GB tab files in about 7 minutes.
|