This is a simple feature mapper. It is designed using information currently available in the Parquet file.
Each position has the following information:
69033640 11 false
position=14521 referenceIndex=0 isMutated=false
sample 0 counts=[10, 0, 0, 63, 0, 4, 0, 1, 62, 0]
sample 1 counts=[2, 0, 0, 45, 0, 3, 0, 0, 68, 0]
position=14521 referenceIndex=0 isMutated=true
sample 0 counts=[10, 0, 0, 63, 0, 4, 0, 1, 62, 0]
sample 1 counts=[2, 11, 0, 34, 0, 3, 12, 0, 56, 0]
Using these data, we can map to features as follows:
- Make the isMutated boolean the only label. This is not ideal: the net may tell us if the base is mutated,
but we will not know what the mutation is..
- Concatenate the count integers and use these as features. The only way for the net to learn from these data is to count
the number of counts elements that has "enough" reads to call a genotype. If the genotype calls are more than two, then
the site is likely mutated, because most sites will be heterozygous at most. With these features, I expect the
net to have more trouble predicting mutations at homozygous sites, than at heterozygous sites. We'll see.
Created by fac2003 on 5/21/16.