Re: Parsing Large Text Files
Re: Parsing Large Text Files
- Subject: Re: Parsing Large Text Files
- From: "Mark J. Reed" <email@hidden>
- Date: Thu, 1 May 2008 23:44:50 -0400
On Thu, May 1, 2008 at 11:38 PM, Bruce Robertson <email@hidden> wrote:
> The header line should not be changed or broken at 50. Only the following
> lines should be changed.
>
> Given:
>
>
> >2007006285 Uroporphyrinogen-III decarboxylase [LWCv2]
> VHYQQSHARIATLMAAANEPPTWQTIGHGLNFVHGDGKGYSALVSIVEKE
> IAEPTSLLIAPDLNGQLAVKDGVRKRASGIDVTWDLGLADSGIEAQAELW
> LGGGKTFVISPVKRGDNTKILGNVIKQMYNLSFETYANHA
>
> You should get:
>
>
> >2007006285 Uroporphyrinogen-III decarboxylase [LWCv2]
> AHNAYTEFSLNYMQKIVNGLIKTNDGRKVPSIVFTKGGGLWLEAQAEIGS
> SDALGLDWTVDIGSARKRVGDKVALQGNLDPAILLSTPEAIEKEVISVLA
> ASYGKGDGHVFNLGHGITQWTPPENAAAMLTAIRAHSQQYHV
OK. That's what my script as posted outputs, except for adding the
text "Reversed" to the first
> But yours gives:
>
> VVEGQAQDWTTLCRTI
> DARAAAFRQQGLAAGDCVALRGRNSVELVLAY
>
> LAALQLGARVLPLNPQLP
> DAQLQPLLPALDIDWGWSEAGDHWPGPVRPL
>
> TSDVAVATPVPTNPAVTWQ
> PGAPATLTLTSGSSGLPKGVLHCAANHLAS
>
> AAGLLAALPFTAGDGWLLSL
> PLFHVSGQGIVWRWLLRGARLLLVAEGDL
Not even close to what I get. How are you running it?
To avoid any copy/paste issues, I've attached the script file to this
email message, modified to remove the "reversed" tag, along with
sample input and output (just your original snippet). The attachment
probably won't make it to the list, just to you.
>
> AQALAGCSHASLVPTQLQRLL
> AQNASLPALQHVLLGGAAIPVALTQRAE
>
> QAGIHCWCGYRLTEMASTVTAK
> ]2vCWL[ II se sagil dica-PM
>
> A/)gnimrof-PMA( sesatehtnys AoC-lycA 6826007002>
> A
>
> HNAYTEFSLNYMQKIVNGLIKTNDGRKVPSIVFTKGGGL
> WLEAQAEIGS
>
> DALGLDWTVDIGSARKRVGDKVALQGNLDPAILLSTPEAI
> EKEVISVLA
>
> SYGKGDGHVFNLGHGITQWTPPENAAAMLTAIRAHSQQYHV
> ]2vCWL[
>
>
>
> > Ok, so my small correction broke the output by leaving '>'s in.
> > Here's a no-really-this-time corrected version, with commentary added
> > so you can follow the logic.
> >
> > #!/usr/bin/perl
> >
> > # Read one whole protein at a time: instead of reading one line,
> > # keep reading until there's a CRLF followed by a '>'
> > $/ = "\r\n>";
> >
> > # Repeat while there's input remaining
> > while (<>)
> > {
> > # chop off initial > if any (only happens on first line)
> > s/^>//o;
> >
> > # chop off final > if any (all but last line)
> > s/>$//o;
> >
> > # strip off the first line (name of protein) so it doesn't get
> > # included in the reversal
> > s/^(.*?)\r\n(.*)$/$2/os;
> >
> > # but remember that name for later
> > my $name = $1;
> >
> > # get rid of all CRLF's
> > s/\r\n//og;
> >
> > # reverse what's left
> > $_ = reverse($_);
> >
> > # put CRLF's back in every 50 characters
> > s/.{50}/$&\r\n/og;
> >
> > # and output, with name
> > print ">$name reversed:\r\n$_\r\n";
> > }
> >
>
>
--
Mark J. Reed <email@hidden>
Attachment:
proteins.zip
Description: Zip archive
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden