Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Perl Programming Language

Hashed to death... help needed.


First of all, hello!
I've been reading the group for quite a while now, but this is the first
time I really need help. This is my second attempt at posting to this
particular server, my first post has gone missing, hope it won't double it.

To summarize things: I'm building a parser which has to consolidate data
based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are
parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a
file which should contain:
pdbID | startRes | endRes | secstructID

source array with the data for consolidation:
1b6g     1      M    \N
1b6g     2      V    \N
1b6g     3      N    \N
1b6g     4      N    H
1b6g     5      N    H
1b6g     6      N    \N
3hba     7      W    H
2cdg     8      N     H
2cdg     9      V     \N
2cdg     10    M    \N
2cdg     11    A    B
2cdg     12    M    \N

expected result after consolidation, should be:
1b6g    1    3     \N
1b6g    4    5     H
1b6g    6    6     \N
3hba    7   7      H
2cdg    5    6     H
2cdg    7    7     \N
2cdg    8    8     H
2cdg    9   10    H
2cdg    11 11    B
2cdg    12 12    \N

As you can see each pdbID is assigned a secStructuID in a sequential manner
and any interruptions are to be considered as points from which the
assignment starts new.
Each pdbID can thus have multiple occurences of for example \N in different
places of the sequence and they are differentiated by the startRes and
endRes values.
All is wonderful and I have a working code which consolidates the data,
unfortunately it doesn't recognize the occurence of the new secstructID
automatically as the end of the previous one rather it finds the last
possible in the whole sequence for one pdbID and considers that as the end.

and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N  ---- error here - this should be in fact two separate
"entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B

And here's my code:

---------------------------------------------
#!/usr/bin/perl -w
use strict;
use warnings;

#   --------------------------------------------------------------
# This script uses the residue.txt file generated by
# resTabmakerBatch.pl and creates a new file called
# SecStructList.txt
# each protein is described by secondary structures with a
# pdbID, 2ry structureID (char or \N'), startResidue, endResidue
# Input: residue.txt (this file is the output of resTabmakerBatch.pl)
# Output: secStructList.txt
# usage: secStructList.txt to populate the SecStructure entity
#   --------------------------------------------------------------

#Read arguments, print error message if insufficient
if ($#ARGV<0)
{
 die("\n\nUsage:  sstruct.pl [residue_table_file.txt]\n\n");

}

my $filename = $ARGV[0];

#if either file not found return error message
if (! -e "$filename")
{
 die("\n\nresidue file $filename does not exist!\n\n");

}

 # Read residue.txt file, extracting the data of interest - only
 # pdb id, resNum, resID, secondaryStructID

#First read file, storing each line in an array 'dssplines' splitting the
data
open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
 my @dssplines= split(/\r/, <MYFILE>);
 my $arraySize=@dssplines;
close(MYFILE);
#read one line from the originally loaded array dssplines at a time and loop
#over it splitting the values using the tabs
 my @dsspdata;
 my $dsspdataSize=@dssplines;

 my $n=0;
 for (my $i=0; $i < $arraySize; $i++)
 {
  #each line from the array goes into a new dsspline variable
  my $dsspline = $dssplines[$i];
  for (my $j = 0; $j <=4; $j++)
  {
   #each time values inside are separated using the tabs
   my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
$activesite) = split(/\t/, $dsspline);
   # now each value of interest is stored into a new array @dsspdata
   $dsspdata[$n][0] = $pdbID;
   $dsspdata[$n][1] = $resNo;
   $dsspdata[$n][2] = $resID;
   $dsspdata[$n][3] = $secStruct;
  }
  $n++;
 }

#my @dsspdata array is now perfect to reformat into a hash analyzing the
value correlation

 #initialize the hash and counter
 my %dane;
 my $k=0;
 #loop around the dsspdata array
 for (my $i=0; $i < $dsspdataSize; $i++)
 {
  #split each cell in a row into variables for the hash
  for (my $k = 0; $k <=4; $k++)
  {
   my $pdb = $dsspdata[$i][0];
   my $residueNum = $dsspdata[$i][1];
   my $secStructure = $dsspdata[$i][3];
   push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
  }
  $k++;
 }

 #now for each pdbID using the hash keys
 foreach my $pdbID ( keys %dane )
 {
  #check the secondary structure id with pdbID as a key (only if the pdbID
is the same will the values be stored)
  foreach my $secID ( keys %{ $dane{$pdbID} } )
  {
   #finally create an array of residue numbers
   my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );
   #create a new file with the secondary structures list
   open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
   #append each line to the new file with tab separated data
   print SStruc ("$pdbID \t @resnums \t $secID\n");
  }
 }
close(SStruc);

If anyone has an idea how to deal with this I would be very grateful for
any suggestions.

Cheers,
Matt

In article <f325k9$vm@news.onet.pl>, Teq <zro@poczta.onet.pl>
wrote:

> First of all, hello!
> I've been reading the group for quite a while now, but this is the first
> time I really need help. This is my second attempt at posting to this
> particular server, my first post has gone missing, hope it won't double it.

Have you taken the time to read the guidelines for posting to this
group?

[other comments snipped]

> And here's my code:

You should minimize your program as much as possible before posting
(details below).

We don't need to see these comments. Although good practice for actual
development, the people reading this post don't really care what the
program does, just how it works.

You can eliminate all of the above lines by using the special <DATA>
filehandle (see posting guidelines).

>  my @dssplines= split(/\r/, <MYFILE>);

There is no reason to read in the whole file if you are going to
process the lines one at a time.

>  my $arraySize=@dssplines;
> close(MYFILE);
> #read one line from the originally loaded array dssplines at a time and loop
> #over it splitting the values using the tabs
>  my @dsspdata;
>  my $dsspdataSize=@dssplines;

>  my $n=0;
>  for (my $i=0; $i < $arraySize; $i++)
>  {
>   #each line from the array goes into a new dsspline variable
>   my $dsspline = $dssplines[$i];
>   for (my $j = 0; $j <=4; $j++)
>   {

What is the purpose of this loop? You are doing the same things 5 times.

>    #each time values inside are separated using the tabs
>    my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
> $activesite) = split(/\t/, $dsspline);

So your file actually contains at least 9 fields, instead of the 4 you
show above?

>    # now each value of interest is stored into a new array @dsspdata
>    $dsspdata[$n][0] = $pdbID;
>    $dsspdata[$n][1] = $resNo;
>    $dsspdata[$n][2] = $resID;
>    $dsspdata[$n][3] = $secStruct;

$n will always have the same value as $i. Why didn't you just use $i
here.

You can simplify your code by not using numerical indexes, but instead
use Perl's native array manipulation functions, eg. something like
(untested):

  while( <DATA> ) {

    my($pdbID,$resNo,$resID,undef,undef,undef,undef,$secStruct) = split;

or
    my($pdbID,$resNo,$resID,$secStruct) = (split)[0,1,2,7];

    push( @dsspdata, [$pdbID,$resNo,$resID,$secStruct];

  }

Another loop that is invariant with respect to its loop variable, so
needlessly repeated 5 times.

>    my $pdb = $dsspdata[$i][0];
>    my $residueNum = $dsspdata[$i][1];
>    my $secStructure = $dsspdata[$i][3];
>    push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
>   }
>   $k++;
>  }

That is as far as I wished to analyze your program. If you fix up the
peculiarities mentioned and re-post it (albeit a shorter version that
uses <DATA>), I or someone else can probably help you.

[rest of program snipped]

 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com

"Jim Gibson" <jgib@mail.arc.nasa.gov> wrote in message

news:230520071458272369%jgibson@mail.arc.nasa.gov...

> Have you taken the time to read the guidelines for posting to this
> group?
[...]
> That is as far as I wished to analyze your program. If you fix up the
> peculiarities mentioned and re-post it (albeit a shorter version that
> uses <DATA>), I or someone else can probably help you.

Admittedly I deserved that :(
Thanks for the tips, I've redesigned the program (loosing about 50 lines of
code, sweet ;)), but this time I'll post only the part that is of
consequence ;)

----------------------------------------------
#!/usr/bin/perl -w
use strict;
use warnings;

my %dane;
while (<DATA>) {
    my ( $x, $y, undef, $z ) = split;

    push @{ $dane{$x}->{$z} }, $y;

}

foreach my $pdbID ( keys %dane ) {
    foreach my $secID ( keys %{ $dane{$pdbID} } ) {
        my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );

        print "$pdbID @resnums $secID\n";
    }

}

__DATA__
1b6g     1      M    \N
1b6g     2      V    \N
1b6g     3      N    \N
1b6g     4      N    H
1b6g     5      N    H
1b6g     6      N    \N
3hba     7      W    H
2cdg     8      N     H
2cdg     9      V     \N
2cdg     10    M    \N
2cdg     11    A    B
2cdg     12    M    \N

-------------------------------------

the output is
3hba     7     7     H
1b6g     4     5     H
1b6g     1     6     \N
2cdg     8     8     H
2cdg     9     12   \N
2cdg     11   11   B

and I don't know how to get that sucker to identify when secStructID has
changed and finish one entity, and only store sequential groups of secondary
structures for the same pdbID.

the output should be:
1b6g     1      3      \N
1b6g     4      5       H
1b6g     6      6      \N
3hba     7      8       H
2cdg     9     10     \N
2cdg     11    A      B
2cdg     12    M     \N

And no, it's not homework ;) I'm trying to redo something that I have
written in Java some time ago, at the same time I'm learning perl, hence all
the weird loops in original code, instead of the perl-typical code. All I
know of hashes I have read in the last 7 hours, so please have mercy ;)
Before I started seeking help, it was 250 lines of code (mostly
java-style), doing less than the above. I will appreciate any hints as, I am
stuck.

Cheers,
Matt

It is not clear that a hash is the right way to go about this.  It seems
like you want to address the data in the same sequence that it is
presented, rather than storing everything in a mostly orderless data
structure.  If so, then you need to record the PDBid and seqID from the
last time through the loop and check to see if it changed on this time
through the loop.  That kind of code is conceptually simple but alas it is
tedious and ugly (in any language, as far as I know).  Out of curiosity,
how did you do it in Java?

This is what I came up with:

my ( $x, $y, undef, $z ) = split ' ', <DATA>;
my ($last_x,$last_z,$min,$max)=($x,$z,$y,$y);

while (<DATA>) {
    my ( $x, $y, undef, $z ) = split;
    if ($x ne $last_x or $z ne $last_z) {
       print "$last_x $min $max $last_z\n";
       ($last_x,$last_z,$min,$max)=($x,$z,$y,$y);
    };
    $max=$y;

}

print "$last_x $min $max $last_z\n";

It is ugly because the print has to happen in two places (in the loop
and after the loop) and the assignment has to happen in two places
(before the loop and in the loop), but as I say this is a very common
piece of ugliness.

> and only store sequential groups of
> secondary structures for the same pdbID.

> the output should be:
> 1b6g     1      3      \N
> 1b6g     4      5       H
> 1b6g     6      6      \N
> 3hba     7      8       H

Is this really what you want?  7 was in 3hba and 8 was in 2cdg, so you
didn't detect the transition from 3hba to 2cdg, contrary to what you say
above about "for the same pdbID".

> 2cdg     9     10     \N
> 2cdg     11    A      B
> 2cdg     12    M     \N
> And no, it's not homework ;) I'm trying to redo something that I have
> written in Java some time ago, at the same time I'm learning perl, hence
> all the weird loops in original code, instead of the perl-typical code.
> All I know of hashes I have read in the last 7 hours, so please have
> mercy ;) Before I started seeking help, it was 250 lines of code (mostly
> java-style), doing less than the above. I will appreciate any hints as, I
> am stuck.

Going back to your hash-based code: In stead of taking the first and last
element of the array @{$dane{$pdbID}->{$secID}}, like you currently do,
you could detect gaps in the values of this array and take the first and
last elements of each non-contiguous chunk.  You could probably even do
this in a single grep or map statement, but I think doing so would be
overly clever.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service                        $9.95/Month 30GB

In article <f32i2m$3j@news.onet.pl>, Teq <zro@poczta.onet.pl>
wrote:

> "Jim Gibson" <jgib@mail.arc.nasa.gov> wrote in message
> news:230520071458272369%jgibson@mail.arc.nasa.gov...
> > Have you taken the time to read the guidelines for posting to this
> > group?
> [...]
> > That is as far as I wished to analyze your program. If you fix up the
> > peculiarities mentioned and re-post it (albeit a shorter version that
> > uses <DATA>), I or someone else can probably help you.

> Admittedly I deserved that :(
> Thanks for the tips, I've redesigned the program (loosing about 50 lines of
> code, sweet ;)), but this time I'll post only the part that is of
> consequence ;)

Thank you for that.

> ----------------------------------------------
> #!/usr/bin/perl -w
> use strict;
> use warnings;

> my %dane;
> while (<DATA>) {
>     my ( $x, $y, undef, $z ) = split;

You are discarding your third field, yet it shows up in your output
below (last two lines). Either save it here or don't expect it to be
output.

>     push @{ $dane{$x}->{$z} }, $y;
> }

> foreach my $pdbID ( keys %dane ) {
>     foreach my $secID ( keys %{ $dane{$pdbID} } ) {
>         my @resnums = ( $dane{$pdbID}->{$secID}->[0],
> $dane{$pdbID}->{$secID}->[-1] );

You are printing the first and last values for each pdbID and secID
pair, and not accounting for the possibility that some values are
skipped. You need to add logic for that case, which is part of your
test data and reflected in your desired output.

>         print "$pdbID @resnums $secID\n";
>     }
> }

> __DATA__
> 1b6g     1      M    \N
> 1b6g     2      V    \N
> 1b6g     3      N    \N
> 1b6g     4      N    H
> 1b6g     5      N    H
> 1b6g     6      N    \N
> 3hba     7      W    H
> 2cdg     8      N     H

The above two lines have the same value in field 4, but different
values in field 1. Your desired output ignores the different value in
field 1 of the second of the two records. Is that correct?

You need to compare each record read with the previous record and see
if the relevant fields have changed. For this, a hash is not the best
data type. An array would be better.

> the output should be:
> 1b6g     1      3      \N
> 1b6g     4      5       H
> 1b6g     6      6      \N
> 3hba     7      8       H
> 2cdg     9     10     \N
> 2cdg     11    A      B
> 2cdg     12    M     \N

You want the output in the same order as the input. But plain hashes do
not preserve the order in which elements are inserted (unless you use a
special "tied" hash -- see perldoc -q "hash remember")

> And no, it's not homework ;) I'm trying to redo something that I have
> written in Java some time ago, at the same time I'm learning perl, hence all
> the weird loops in original code, instead of the perl-typical code. All I
> know of hashes I have read in the last 7 hours, so please have mercy ;)
> Before I started seeking help, it was 250 lines of code (mostly
> java-style), doing less than the above. I will appreciate any hints as, I am
> stuck.

Here is a program that gives your desired output, except for the
discarded element in the original record. I leave that as an exercise
for the reader.

#!/usr/local/bin/perl
use strict;
use warnings;

my $c = ['',0,'',0];
my @dane;
while (<DATA>) {
  chomp;
  my $d = [(split)[0,1,3]];
  push(@$d,$d->[1]);
  if( ($d->[2] eq $c->[2]) ) {
    if( $d->[1] eq $c->[3]+1 ) {
      $c->[3] = $d->[1];
    }else{
      push(@dane,[(@$c)[0,1,3,2]]) if $c->[0];
      $c = $d;
    }
  }else{
    push(@dane,[(@$c)[0,1,3,2]]) if $c->[0];
    $c = $d;
  }

}

push(@dane,[(@$c)[0,1,3,2]]) if $c->[0];
for (@dane) {
  print join("\t",@$_),"\n";

}

__DATA__
1b6g     1      M    \N
1b6g     2      V    \N
1b6g     3      N    \N
1b6g     4      N    H
1b6g     5      N    H
1b6g     6      N    \N
3hba     7      W    H
2cdg     8      N     H
2cdg     9      V     \N
2cdg     10    M    \N
2cdg     11    A    B
2cdg     12    M    \N

__OUTPUT__
1b6g    1       3       \N
1b6g    4       5       H
1b6g    6       6       \N
3hba    7       8       H
2cdg    9       10      \N
2cdg    11      11      B
2cdg    12      12      \N

 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com

<xhos@gmail.com> wrote in message

news:20070523202155.966$Pa@newsreader.com...

> It is not clear that a hash is the right way to go about this.  It seems
> like you want to address the data in the same sequence that it is
> presented, rather than storing everything in a mostly orderless data
> structure.  If so, then you need to record the PDBid and seqID from the
> last time through the loop and check to see if it changed on this time
> through the loop.  That kind of code is conceptually simple but alas it is
> tedious and ugly (in any language, as far as I know).  Out of curiosity,
> how did you do it in Java?

In Java the source was very much different because it was a raw file, these
here are already pre-processed by other scripts.
For each pdb file the data would be parsed and I would iterate through my
original parser-output arrays extracting data into variables, which then I
used to create AAObjects (containing resNum, resID, secStructID), which in
turn were then moved into aaObjectList - an array of objects.
Then I used these objects to create secondary structure objects and
secondary structure array of objects.
I would iterate around the initial aaObjectList checking for secStructID
(e.g. E,G,H) and each time I would discover a new one I'd set start
position, each time I'd encounter a different one I'd set end position (-1)
and pass the secStructID along with the length of the object start-to-end,
resID list from start position to the end position into a new object
SSObject.
Obviously as you said - the data was in proper order, unlike in case of
hashes, of which I was not aware :/
Java for me is more straightforward, but it's overkill for some tasks (my
original dssp parser was with all the bling ;) and it was something around 3
thousand lines), hence me working perl.
Still, it took me 3 days to understand that I can't really work with objects
in perl like I do in Java, and another 2 to realize that java-like loops of
30 lines+ can be compressed into 1 line of code (for this I thank Jim which
was kind enough to point it out ;))

> This is what I came up with:

[code cut]
Thank you very much for the code, I will work it out tomorrow, it's almost
3am here, I'm going to bed ;)

>> 1b6g     6      6      \N
>> 3hba     7      8       H
> Is this really what you want?  7 was in 3hba and 8 was in 2cdg, so you
> didn't detect the transition from 3hba to 2cdg, contrary to what you say
> above about "for the same pdbID".

You are absolutely right, my bad - these should have been separated, that's
what happens when I sit at my computer after midnight ;(.

> Going back to your hash-based code: In stead of taking the first and last
> element of the array @{$dane{$pdbID}->{$secID}}, like you currently do,
> you could detect gaps in the values of this array and take the first and
> last elements of each non-contiguous chunk.

The idea with using a hash here is mostly because I want to learn it, and
because it was adviced by another usenet user :) not because it is a good
practice (also currently I am such a newbie that I don't really know what is
good practice in the first place) - I have managed to conquer this problem
with a set of nested if, elsif's but it was UGLY and a couple HUNDRED lines
longer than a perl script should be, I will follow your tips. Thank you !

Cheers,
Matt

"Jim Gibson" <jgib@mail.arc.nasa.gov> wrote in message

news:230520071749320480%jgibson@mail.arc.nasa.gov...

>> my %dane;
>> while (<DATA>) {
>>     my ( $x, $y, undef, $z ) = split;

> You are discarding your third field, yet it shows up in your output
> below (last two lines). Either save it here or don't expect it to be
> output.

hmm, I'll double check but I'm almost sure it's not in the output, I am
considering using it, but it'll have to wait until I have finished the
database schema.

>> foreach my $pdbID ( keys %dane ) {
>>     foreach my $secID ( keys %{ $dane{$pdbID} } ) {
>>         my @resnums = ( $dane{$pdbID}->{$secID}->[0],
>> $dane{$pdbID}->{$secID}->[-1] );

>> 3hba     7      W    H
>> 2cdg     8      N     H

> The above two lines have the same value in field 4, but different
> values in field 1. Your desired output ignores the different value in
> field 1 of the second of the two records. Is that correct?

no no, that was a stupid mistake of mine, as Xho has pointed out in the
earlier message, the output should actually differentiate between these two
3hba     7      7    H
2cdg     8      8     H

> You need to compare each record read with the previous record and see
> if the relevant fields have changed. For this, a hash is not the best
> data type. An array would be better.
> You want the output in the same order as the input. But plain hashes do
> not preserve the order in which elements are inserted (unless you use a
> special "tied" hash -- see perldoc -q "hash remember")

I completely disregarded the fact that the data in the hash is unordered
assuming that the identification by pdbID and sequence of values in resNum
would be sufficient, I have to think about this.

> Here is a program that gives your desired output, except for the
> discarded element in the original record. I leave that as an exercise
> for the reader.

Thank you very much for all your trouble. Although the code by Xho above
seems far simpler I will definitely try to work out the idea with hashes,
best time to learn it is now :).

Regards,
Matt

Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc