|
|
 |
 |
 |
 |
Fortran Programming Language
|
 |
 |
 |
 |
 |
 |
 |
 |
Help a newbie develop a program
Hi everybody, I was asked to learn FORTRAN as part of my research job. My boss would like me to write a program that models the breakdown of a long linear molecule. Here's the idea: The molecule is a lot like single stranded DNA but has more components. Let's assume I've assigned the components arbitrary numbers. The molecule would look like this: 1 2 3 1 2 3 1 2 4 2 4 1 2 1 2... Then, I need to "break" the chain between certain pairs of numbers, say 1 and 2, resulting in fragments: 1 231 231 24241 21 I then need to group identical fragments together and, ideally, sort them by length, to get an output that looks like this: # Frags Chain 1 1 1 21 2 231 1 2421 I have already written program to read in the chain, break it, and read out the fragments. I haven't figured out a nice way to compare the fragments to one another. I think my program is pretty hack-y, though, and does not make efficient use of FORTRAN. Does anyone have any ideas on what sorts of data structures I might use to make the comparison of fragments doable? The program needs to be able to handle different sized (input) chains and various numbers of fragments and fragment lengths (output). I've accomplished this--pooryly--using allocatable arrays. I'd be happy to share my program or give more details. Thanks! --Joel
joel.berna @gmail.com wrote: > The molecule is a lot like single stranded DNA but has more > components. Let's assume I've assigned the components arbitrary > numbers. The molecule would look like this: > 1 2 3 1 2 3 1 2 4 2 4 1 2 1 2... > Then, I need to "break" the chain between certain pairs of numbers, > say 1 and 2, resulting in fragments: (snip) > I then need to group identical fragments together and, ideally, sort > them by length, to get an output that looks like this:
Look for literature on suffix trees or suffix arrays. Much of the work on these problems is done in C, though it shouldn't be so hard to do in Fortran. There are journals for computational molecular biology where these tend to be published, and also in computer science where they are the favorite examples for many algorithms. -- glen
<joel.berna @gmail.com> wrote in message news:1181154994.371601.292490@r19g2000prf.googlegroups.com... This snippet would work for reasonably small strings, but may not generalise well. Of course, this is just to demonstrate the count intrinsic. character(len=5) :: str(4) = (/ '1 ', '231 ', '24241', '21' /), inp(5) open(10, file ='temp.txt') read(10, '(a)') inp do i = 1, 4 print *, count(str(i) == inp) end do end Regards, Mike Metcalf
joel.berna @gmail.com wrote: | Hi everybody, | | I was asked to learn FORTRAN as part of my research job. My boss | would like me to write a program that models the breakdown of a long | linear molecule. Here's the idea: | | The molecule is a lot like single stranded DNA but has more | components. Let's assume I've assigned the components arbitrary | numbers. The molecule would look like this: | | 1 2 3 1 2 3 1 2 4 2 4 1 2 1 2... ======================= NOTE: halfway through compiling this post, I noticed the additional requirement of handling duplicate chains, but I'm now short of time to consolidate it. Still, apart from incoherence, I hope you'll find some valid points therein, but please triple-check it. ======================= I would tackle the problem in the following way (I'll stick to "plain-vanilla" style; it can be also tackled with an arrays of TYPEs with pointer components, but let's keep it simple): * read the contents into an array A of (say) integers (characters will also be fine) * declare two "friend" arrays: ** IND, containing the initial index within A of each subsequence ** LEN, containing the length of each sequence * find the IRANK (i.e. sorting order, without actual resorting) of LEN array. I was about to suggest Michel Olagnon's MRGRNK within ORDERPACK: http://www.fortran-2000.com/rank/index.html#3.0 but I spotted the additional requirement of counting the number of same sequences. Thus, an additional logic is called for. * according to IRANK, print out the sorted contents of A Note that technique uses zero (0) rearrangements of the original array A. Some (pseudo) code: INTEGER, ALLOCATABLE:: A(:), ind(:), len(:), irank(:) INTEGER, ALLOCATABLE:: iCount(:) !# of occurrences of !same sequences INTEGER:: nSeq !number of obtained sequences INTEGER:: nMolecules !number of molecules within A ALLOCATE(A(nMolecules)) !Worst case size (and then some): ALLOCATE(ind(SIZE(A)), len(SIZE(A)), irank(SIZE(A))) !Read A here !Find the ind and len arrays molecule1 = 1 molecule2 = 2 nSeq = 0 !I'll probably miss the exact intended logic of !breaking here, but you get the idea: DO i=1,nMolecules IF (A(i).eq.molecule1) THEN nSeq=nSeq+1 ind(nSeq) =i ELSE IF (A(i).eq.molecule2) THEN len(nSeq) = i-ind(nSeq) END IF END DO !Rank the len and ind arrays iRank=(/(i, i=1,nSeq)/) !THIS IS POOR MAN'S SELECTION SORT ALGO; !BETTER FIND SOMETHING MORE EFFICIENT DO i=1,nSeq lenMax = i DO j=i+1,nSeq IF (len(j).GT.len(lenMax)) THEN lenMax = j ELSE IF (len(j).EQ.len(lenMax)) !For same lengths, compare also the contents of chain: DO k=0, len(iRank(j))-1 SELECT CASE(A(ind(iRank(j)+k) - A(ind(iRank(lenMax)+k)) CASE(1:) lenMax = j EXIT CASE(:-1) EXIT END SELECT END DO END IF END DO !Swap the ranks iTemp = iRank(i) iRank(i) = iRank(lenMax) iRank(lenMax) = iRank(i) END DO !Print the sequences DO i=1,nSeq WRITE(...) iCount(i), A(ind(iRank(i)):ind(iRank(i))+len(iRank(i))-1) END DO -- Jugoslav ___________ www.xeffort.com Please reply to the newsgroup. You can find my real e-mail on my home page above.
> The program needs to be able to handle different sized (input) chains > and various numbers of fragments and fragment lengths (output). I've > accomplished this--pooryly--using allocatable arrays. I'd be happy to > share my program or give more details.
I assume you have now an array with all the fragments, which i call all_frags. I would construct two new arrays; one with all the fragments occurring only once (unique_frags), and one with the number of times they occurred (occurrence). Then I would sort them. I bet you can do that in one go, but I find it usually easier to understand to chop up the big algorithm into smaller (sequential) pieces. program get_frags integer :: all_frags(5), unique_frags(5), occurrence(5) logical :: accounted(5) integer :: i,n all_frags = [1,231,231,24241,21] accounted = .false. ! keeps track of which frag in all_frags has been counted yet n = 0 ! total number of unique fragments do i=1,size(all_frags) if (accounted(i)) cycle n = n+1 unique_frags(n) = all_frags(i) occurrence(n) = count( (all_frags==all_frags(i)) ) where ( ((all_frags==all_frags(i)).or.(accounted)) ) accounted=.true. enddo end program unique_frags(1:n) contains all the unique fragments and occurrence(1:n) its number of counts. Note that n<size(all_frags), but that those two arrays are of length size(all_frags). You might want to copy them into arrays of size n and deallocate the former 2. They are not sorted yet, but have a look at numerical recipes for sort algorithms. Arno
Arno, Set up a binary tree (see Chapman, for example, or do a search on the web) using the fragment as the key and the number of occurrences as the data field. Should you ever want to count occurrences of neighbours at each end, you could make fields that had binary trees of nearest neighbours and so on. Paul
"Jugoslav Dujic" <jdu @yahoo.com> wrote in message news:5cq5j7F30tuq9U1@mid.individual.net... > joel.berna @gmail.com wrote: > !I'll probably miss the exact intended logic of > !breaking here, but you get the idea: Here you're looking for the beginning and end of each sequence. It would also be possible to look for the 'breakpoints' using the INDEX intrinsic. Something like: start = 1 do ind = index(string(start:), substring) snippet(i) = string(start:ind) start = ind + 1 The last snippet becomes a special case. Regards, Mike Metcalf
Jugoslave, Thanks. I modeled a good chunk of the program on this outline and was succesful. The program is able to compare fragments; identify, count, and eliminate duplicates; and sort the final fragment list by length. I am now trying to make the "cutting" part of the program more efficient. In reality, there are about 20 possible elements in the chain, and any given enzyme cuts between 10 different pairs (say 1-1, 1-2, 2-2, 2-4, 2-7, etc.). Example: 1 1 2 2 3 3 1 2 1 2 Enzyme 1 cuts to produce... 1 1 2 3: 1 2331 -----> 3: 2 2 1: 2331 1 2 The chain is stored in an array and I use a slew of IF THEN statements to step through the array looking for pairs to cut. Do you (or does anyone) know of a better way to to look through the chain for pairs? Thanks again, Joel
In a previous article, joel.berna@gmail.com wrote:
>Jugoslave, >Thanks. I modeled a good chunk of the program on this outline and was >succesful. The program is able to compare fragments; identify, count, >and eliminate duplicates; and sort the final fragment list by length. >I am now trying to make the "cutting" part of the program more >efficient. In reality, there are about 20 possible elements in the >chain, and any given enzyme cuts between 10 different pairs (say 1-1, >1-2, 2-2, 2-4, 2-7, etc.). >Example: >1 1 2 2 3 3 1 2 1 2 >Enzyme 1 cuts to produce... >1 >1 >2 3: 1 >2331 -----> 3: 2 >2 1: 2331 >1 >2 >The chain is stored in an array and I use a slew of IF THEN statements >to step through the array looking for pairs to cut. Do you (or does >anyone) know of a better way to to look through the chain for pairs? >Thanks again, >Joel
Re-ordering "alphabetically" first makes it easy (if I understand what you're doing). Chris
joel.berna @gmail.com wrote: | Jugoslave, | | Thanks. I modeled a good chunk of the program on this outline and was | succesful. The program is able to compare fragments; identify, count, | and eliminate duplicates; and sort the final fragment list by length. | | I am now trying to make the "cutting" part of the program more | efficient. In reality, there are about 20 possible elements in the | chain, and any given enzyme cuts between 10 different pairs (say 1-1, | 1-2, 2-2, 2-4, 2-7, etc.). | | Example: | | 1 1 2 2 3 3 1 2 1 2 | | Enzyme 1 cuts to produce... | | 1 | 1 | 2 3: 1 | 2331 -----> 3: 2 | 2 1: 2331 | 1 | 2 | | The chain is stored in an array and I use a slew of IF THEN statements | to step through the array looking for pairs to cut. Do you (or does | anyone) know of a better way to to look through the chain for pairs? I must confess that I still don't fully understand the logic of chain breaking, especially the terminal conditions. Are the said pairs "symmetrical"? (I.e. does 1-2 mean that 2-1 also breaks the chain?) For example, if the breaking pairs are 1-1 and 1-2, does 132312 break into * 132 | 3 | 12 * 13 | 23 | 12 * something else? Is there a possibility that some array members don't belong to any of subchains? (perhaps it would be easier to post your relevent working code fragment than to explain it in words). In any case, I would go with something along these lines (disclaimer from my previous post still holds :-) ): !Of course, it's a good idea to read these from an input file: integer, parameter:: nPairs=5 !I assumed that iBegin and iEnd are sorted. Zero-termination !is for later convenience: integer:: iBegin(nPairs+1) = (/1,1,2,2,2,0/) integer:: iEnd(nPairs+1) = (/1,2,2,4,7,0/) jPairBegin = 0 !marker of first occurrence of !found pair within iBegin Array jPairEnd = 0 !marker of last occurrence of DO i=1,nMolecules IF (jPairBegin.ne.0) THEN IF (ANY(A(i).eq.iEnd(jPairBegin:jPairEnd)) THEN !Chain ends len(nSeq) = i-ind(nSeq) jPairBegin = 0 !should we reset now??? END IF END IF kPairBegin = 0 DO j=1,nPairs !Scan the iBegin array to find possible terminations IF (A(i).eq.iBegin(j)) THEN !Chain begins. For A(i)=2, !kPairBegin will get value 3 !(first occurence of "2" within iBegin is at j=3) nSeq = nSeq + 1 ind(nSeq) = i kPairBegin = j ELSE IF (kPairBegin.NE.0) THEN !Find last occurrence of "2" within iBegin. !Here's where zero-termination comes into play jPairEnd = j-1 EXIT END IF END DO IF (kPairBegin.NE.0) jPairBegin = kPairBegin END DO -- Jugoslav ___________ www.xeffort.com Please reply to the newsgroup. You can find my real e-mail on my home page above.
joel.berna @gmail.com wrote: > Jugoslave, > Thanks. I modeled a good chunk of the program on this outline and was > succesful. The program is able to compare fragments; identify, count, > and eliminate duplicates; and sort the final fragment list by length. > I am now trying to make the "cutting" part of the program more > efficient. In reality, there are about 20 possible elements in the > chain, and any given enzyme cuts between 10 different pairs (say 1-1, > 1-2, 2-2, 2-4, 2-7, etc.). > Example: > 1 1 2 2 3 3 1 2 1 2 > Enzyme 1 cuts to produce... > 1 > 1 > 2 3: 1 > 2331 -----> 3: 2 > 2 1: 2331 > 1 > 2 > The chain is stored in an array and I use a slew of IF THEN statements > to step through the array looking for pairs to cut. Do you (or does > anyone) know of a better way to to look through the chain for pairs?
Use a logical matrix to tell whether the current pair should be cut or not. You would then have a single IF THEN test at each step. Something like: cuts = .false. cuts(1,2) = .true. cuts(2,2) = .true. ... do i = 1, Length-1 if (cuts(i, i+1)) then ! a fragment ends at i and another starts at i+1 ... endif enddo
|
 |
 |
 |
 |
|