Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Fortran Programming Language

Help a newbie develop a program


Hi everybody,

  I was asked to learn FORTRAN as part of my research job.  My boss
would like me to write a program that models the breakdown of a long
linear molecule.  Here's the idea:

The molecule is a lot like single stranded DNA but has more
components.  Let's assume I've assigned the components arbitrary
numbers.  The molecule would look like this:

1 2 3 1 2 3 1 2 4 2 4 1 2 1 2...

Then, I need to "break" the chain between certain pairs of numbers,
say 1 and 2, resulting in fragments:

1
231
231
24241
21

I then need to group identical fragments together and, ideally, sort
them by length, to get an output that looks like this:

# Frags      Chain
1                 1
1                 21
2                 231
1                2421

I have already written program to read in the chain, break it, and
read out the fragments.  I haven't figured out a nice way to compare
the fragments to one another.  I think my program is pretty hack-y,
though, and does not make efficient use of FORTRAN.  Does anyone have
any ideas on what sorts of data structures I might use to make the
comparison of fragments doable?

The program needs to be able to handle different sized (input) chains
and various numbers of fragments and fragment lengths (output).  I've
accomplished this--pooryly--using allocatable arrays.  I'd be happy to
share my program or give more details.

Thanks!  --Joel

joel.berna@gmail.com wrote:
> The molecule is a lot like single stranded DNA but has more
> components.  Let's assume I've assigned the components arbitrary
> numbers.  The molecule would look like this:
> 1 2 3 1 2 3 1 2 4 2 4 1 2 1 2...
> Then, I need to "break" the chain between certain pairs of numbers,
> say 1 and 2, resulting in fragments:

(snip)

> I then need to group identical fragments together and, ideally, sort
> them by length, to get an output that looks like this:

Look for literature on suffix trees or suffix arrays.

Much of the work on these problems is done in C, though it
shouldn't be so hard to do in Fortran.  There are journals
for computational molecular biology where these tend to
be published, and also in computer science where they are
the favorite examples for many algorithms.

-- glen

<joel.berna@gmail.com> wrote in message

news:1181154994.371601.292490@r19g2000prf.googlegroups.com...

This snippet would work for reasonably small strings, but may not generalise
well. Of course, this is just to demonstrate the count intrinsic.

character(len=5) :: str(4) = (/ '1    ', '231  ', '24241', '21' /), inp(5)

open(10, file ='temp.txt')
read(10, '(a)') inp
do i = 1, 4
 print *, count(str(i) == inp)
end do
end

Regards,

Mike Metcalf

joel.berna@gmail.com wrote:

| Hi everybody,
|
|  I was asked to learn FORTRAN as part of my research job.  My boss
| would like me to write a program that models the breakdown of a long
| linear molecule.  Here's the idea:
|
| The molecule is a lot like single stranded DNA but has more
| components.  Let's assume I've assigned the components arbitrary
| numbers.  The molecule would look like this:
|
| 1 2 3 1 2 3 1 2 4 2 4 1 2 1 2...

=======================
NOTE: halfway through compiling this post, I noticed the additional
requirement of handling duplicate chains, but I'm now short of time to
consolidate it. Still, apart from incoherence, I hope you'll find
some valid points therein, but please triple-check it.
=======================

I would tackle the problem in the following way (I'll stick
to "plain-vanilla" style; it can be also tackled with an
arrays of TYPEs with pointer components, but let's keep it simple):

* read the contents into an array A of (say) integers
  (characters will also be fine)
* declare two "friend" arrays:
  ** IND, containing the initial index within A of each
     subsequence
  ** LEN, containing the length of each sequence
* find the IRANK (i.e. sorting order, without actual resorting)
  of LEN array. I was about to suggest Michel Olagnon's MRGRNK
  within ORDERPACK:
  http://www.fortran-2000.com/rank/index.html#3.0
  but I spotted the additional requirement of counting the
  number of same sequences. Thus, an additional logic
  is called for.
* according to IRANK, print out the sorted contents of A

Note that technique uses zero (0) rearrangements of the
original array A.

Some (pseudo) code:

INTEGER, ALLOCATABLE:: A(:), ind(:), len(:), irank(:)
INTEGER, ALLOCATABLE:: iCount(:) !# of occurrences of
                                 !same sequences
INTEGER:: nSeq !number of obtained sequences
INTEGER:: nMolecules  !number of molecules within A

ALLOCATE(A(nMolecules))
!Worst case size (and then some):
ALLOCATE(ind(SIZE(A)), len(SIZE(A)), irank(SIZE(A)))

!Read A here

!Find the ind and len arrays
molecule1 = 1
molecule2 = 2
nSeq = 0
!I'll probably miss the exact intended logic of
!breaking here, but you get the idea:
DO i=1,nMolecules
    IF (A(i).eq.molecule1) THEN
       nSeq=nSeq+1
       ind(nSeq) =i
    ELSE IF (A(i).eq.molecule2) THEN
       len(nSeq) = i-ind(nSeq)
    END IF
END DO

!Rank the len and ind arrays
iRank=(/(i, i=1,nSeq)/)
!THIS IS POOR MAN'S SELECTION SORT ALGO;
!BETTER FIND SOMETHING MORE EFFICIENT
DO i=1,nSeq
   lenMax = i
   DO j=i+1,nSeq
      IF (len(j).GT.len(lenMax)) THEN
         lenMax = j
      ELSE IF (len(j).EQ.len(lenMax))
         !For same lengths, compare also the contents of chain:
         DO k=0, len(iRank(j))-1
             SELECT CASE(A(ind(iRank(j)+k) - A(ind(iRank(lenMax)+k))
             CASE(1:)
                 lenMax = j
                 EXIT
             CASE(:-1)
                 EXIT
             END SELECT
         END DO
      END IF
   END DO
   !Swap the ranks
   iTemp = iRank(i)
   iRank(i) = iRank(lenMax)
   iRank(lenMax) = iRank(i)
END DO

!Print the sequences
DO i=1,nSeq
   WRITE(...) iCount(i), A(ind(iRank(i)):ind(iRank(i))+len(iRank(i))-1)
END DO

--
 Jugoslav
___________
www.xeffort.com

Please reply to the newsgroup.
You can find my real e-mail on my home page above.

> The program needs to be able to handle different sized (input) chains
> and various numbers of fragments and fragment lengths (output).  I've
> accomplished this--pooryly--using allocatable arrays.  I'd be happy to
> share my program or give more details.

I assume you have now an array with all the fragments, which i call
all_frags. I would construct two new arrays; one with all the
fragments occurring only once (unique_frags), and one with the number
of times they occurred (occurrence). Then I would sort them. I bet you
can do that in one go, but I find it usually easier to understand to
chop up the big algorithm into smaller (sequential) pieces.

program get_frags

integer :: all_frags(5), unique_frags(5), occurrence(5)
logical :: accounted(5)
integer :: i,n

all_frags = [1,231,231,24241,21]
accounted = .false. ! keeps track of which frag in all_frags has been
counted yet
n = 0               ! total number of unique fragments

do i=1,size(all_frags)
  if (accounted(i)) cycle
  n = n+1
  unique_frags(n) = all_frags(i)
  occurrence(n) = count( (all_frags==all_frags(i)) )
  where ( ((all_frags==all_frags(i)).or.(accounted)) )
accounted=.true.
enddo

end program

unique_frags(1:n) contains all the unique fragments and
occurrence(1:n) its number of counts. Note that n<size(all_frags), but
that those two arrays are of length size(all_frags). You might want to
copy them into arrays of size n and deallocate the former 2.

They are not sorted yet, but have a look at numerical recipes for sort
algorithms.

Arno

Arno,

Set up a binary tree (see Chapman, for example, or do a search on the
web) using the fragment as the key and the number of occurrences as
the data field.  Should you ever want to count occurrences of
neighbours at each end, you could make fields that had binary trees of
nearest neighbours and so on.

Paul

"Jugoslav Dujic" <jdu@yahoo.com> wrote in message

news:5cq5j7F30tuq9U1@mid.individual.net...

> joel.berna@gmail.com wrote:
> !I'll probably miss the exact intended logic of
> !breaking here, but you get the idea:

Here you're looking for the beginning and end of each sequence. It would
also be possible to look for the 'breakpoints' using the INDEX intrinsic.
Something like:

start = 1
do
   ind  = index(string(start:), substring)
   snippet(i) = string(start:ind)
   start = ind + 1

The last snippet becomes a special case.

Regards,

Mike Metcalf

Jugoslave,

Thanks. I modeled a good chunk of the program on this outline and was
succesful.  The program is able to compare fragments; identify, count,
and eliminate duplicates; and sort the final fragment list by length.

I am now trying to make the "cutting" part of the program more
efficient.  In reality, there are about 20 possible elements in the
chain, and any given enzyme cuts between 10 different pairs (say 1-1,
1-2, 2-2, 2-4, 2-7, etc.).

Example:

1 1 2 2 3 3 1 2 1 2

Enzyme 1 cuts to produce...

1
1
2             3:  1
2331 ----->   3:  2
2             1:  2331
1
2

The chain is stored in an array and I use a slew of IF THEN statements
to step through the array looking for pairs to cut.  Do you (or does
anyone) know of a better way to to look through the chain for pairs?

Thanks again,

Joel

In a previous article, joel.berna@gmail.com wrote:

  Re-ordering "alphabetically" first makes it easy (if I understand
what you're doing).

Chris

joel.berna@gmail.com wrote:

| Jugoslave,
|
| Thanks. I modeled a good chunk of the program on this outline and was
| succesful.  The program is able to compare fragments; identify, count,
| and eliminate duplicates; and sort the final fragment list by length.
|
| I am now trying to make the "cutting" part of the program more
| efficient.  In reality, there are about 20 possible elements in the
| chain, and any given enzyme cuts between 10 different pairs (say 1-1,
| 1-2, 2-2, 2-4, 2-7, etc.).
|
| Example:
|
| 1 1 2 2 3 3 1 2 1 2
|
| Enzyme 1 cuts to produce...
|
| 1
| 1
| 2             3:  1
| 2331 ----->   3:  2
| 2             1:  2331
| 1
| 2
|
| The chain is stored in an array and I use a slew of IF THEN statements
| to step through the array looking for pairs to cut.  Do you (or does
| anyone) know of a better way to to look through the chain for pairs?

I must confess that I still don't fully understand the logic of chain
breaking, especially the terminal conditions. Are the said pairs
"symmetrical"? (I.e. does 1-2 mean that 2-1 also breaks the chain?)
For example, if the breaking pairs are 1-1 and 1-2, does 132312 break into
* 132 | 3 | 12
* 13 | 23 | 12
* something else?
Is there a possibility that some array members don't belong to any of
subchains?

(perhaps it would be easier to post your relevent working code fragment
than to explain it in words).

In any case, I would go with something along these lines (disclaimer
from my previous post still holds :-) ):

!Of course, it's a good idea to read these from an input file:
integer, parameter:: nPairs=5
!I assumed that iBegin and iEnd are sorted. Zero-termination
!is for later convenience:
integer:: iBegin(nPairs+1) = (/1,1,2,2,2,0/)
integer:: iEnd(nPairs+1) = (/1,2,2,4,7,0/)

jPairBegin = 0 !marker of first occurrence of
               !found pair within iBegin Array
jPairEnd = 0   !marker of last occurrence of

DO i=1,nMolecules
    IF (jPairBegin.ne.0) THEN
       IF (ANY(A(i).eq.iEnd(jPairBegin:jPairEnd)) THEN
          !Chain ends
          len(nSeq) = i-ind(nSeq)
          jPairBegin = 0 !should we reset now???
       END IF
    END IF

    kPairBegin = 0
    DO j=1,nPairs
        !Scan the iBegin array to find possible terminations
        IF (A(i).eq.iBegin(j)) THEN
            !Chain begins. For A(i)=2,
            !kPairBegin will get value 3
            !(first occurence of "2" within iBegin is at j=3)
            nSeq = nSeq + 1
            ind(nSeq) = i
            kPairBegin = j
        ELSE IF (kPairBegin.NE.0) THEN
            !Find last occurrence of "2" within iBegin.
            !Here's where zero-termination comes into play
            jPairEnd = j-1
            EXIT
        END IF
    END DO
    IF (kPairBegin.NE.0) jPairBegin = kPairBegin

END DO

--
 Jugoslav
___________
www.xeffort.com

Please reply to the newsgroup.
You can find my real e-mail on my home page above.

Use a logical matrix to tell whether the current pair should be cut
or not. You would then have a single IF THEN test at each step.

Something like:
cuts = .false.
cuts(1,2) = .true.
cuts(2,2) = .true.
...
do i = 1, Length-1
   if (cuts(i, i+1)) then
! a fragment ends at i and another starts at i+1
   ...
   endif
enddo

Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc