Perl exercise: Recoding Word Frequency Counter

We made the wordfrequency counter look at word combinations and how often they occur. Obviously in this text all combinations only occurred once but thats fine. We did this by adding another hash to the read_text subroutine that was already in the script

$prevseen{$prevword . “-” . $word}++;
print $prevword;
$prevword = $word;


which basically tells it to read every word, and print it before it goes on to the next word, then concatenating the two as a combination with a dash in between so we can tell the words apart. The ++ counts the number of occurences of the combination in the text.
We also added

my $prevword = “[start off]”;

to tell it to start anew after a line break. So it doesn’t accidently associate the last word from the previous paragraph with the first of the next one.

Then we changed the print_text subroutine to make a sorted subroutine, which sorts the pairs not according to their values (number of occurrences) but on alphabetical order by sorting it by referring to the keys (i.e. the first word of the word pairs that were concatenated). By changing it into:

sub sorted_printprev_txt {
my @sorted_prevseen = sort (keys %prevseen);
foreach my $w (@sorted_prevseen){
print ” $w $prevseen{$w}n;
}
}

The code as shown below works, and prints the intended output to the screen.

use strict;

my %seen=();
my %prevseen=();
read_text();
sorted_printprev_txt();

sub read_text {

while(<DATA>)
{
chomp;
my $prevword = “[start off]”;
print $_;
my @words = split(/s/,$_);
foreach my $word (@words)
{
$word =~ s/[.”,]*$//; # strip off punctuation, etc.
$prevseen{$prevword . “-” . $word}++;
print $prevword;
$prevword = $word;
}
}
}

sub sorted_printprev_txt {
my @sorted_prevseen = sort (keys %prevseen);
foreach my $w (@sorted_prevseen){
print ” $w $prevseen{$w}n”;
}
}

__DATA__

London has done well in the big freeze. My roads were gritted and salted, and in good time. The buses ran where they could. Trains and Tubes worked valiantly and there were few cases of “millions trapped by wrong snow on line”. The public services deserve a thank you, as does the Kensington shop that offered me a glass of mulled wine on the pavement.

So what does the […] ETCETERA

This refers to the data/text that is incorporated in the scriptfile itself. We tried to enable the script to read in data from any given file (selected by the user) and to save the output – the processed text – in another file (also specified by the user). I lost track when we (Ana, Cliff and me) worked on this but I will try and explain how the code is structured.

use strict;

my %seen=();
my %prevseen=();

print “Welcome to Word Frequency, your friendly text analysing service.n”;
print “Which file would you like to read from? It must be in this directory.n”;
my $infilename = <STDIN>;
chomp $infilename;
open (INFILE, “$infilename”);
my $blah;
print “Which file would you like to write? It will be in this directory as well.  If you want it somewhere else you’ll have to move it yourself.  I do have other things to get on, you know.n”;
my $outfilename = <STDIN>;
chomp $outfilename;
print “Would you like a normal word frequency analysis?  Or you would like a word association frequency analysis? Enter A for the form, B for the latter.n”;
my $selection = <STDIN>;
chomp $selection;
read_text();

if ($selection eq “A”)
{
sorted_print_txt();
exit;
}

if ($selection eq “B”)
{
sorted_printprev_txt();
exit;
}
print “You’ve made an incorrection selection.  I really don’t have time for this.  Come back tomorrow.  Or not at all.n”;

This first part allows for direct user input: it welcomes the user and tells him/her to select a textfile to work with. This file should be in the same directory as where the user is in terminal (just to make it easy). User types in the desired file (<STIN>) and presses [enter], this input is then chomped so the enter is not read as a new line. A file handle is made to refer to the file later on ($blah). The program then asks you to tell it under what name the output should be saved (also in the same directory), this input is again chomped to avoid errors.

The user has a choice to either do a normal word frequency analysis [A] (Graham’s example script) or the association frequency analysis of the pairs that we wrote ourselves [B], the first one refers to subroutine [read_txt] and the latter to our subroutine [printprev_txt]. If neither one of these is entered the program will tell the user that it has made and incorrect selection.

sub read_text {

while($blah = <INFILE>)
{
chomp;
my $prevword = “[start off]”;

my @words = split(/s/,$blah);
foreach my $word (@words)
{
$word =~ s/[.”,]*$//; # strip off punctuation, etc.
$seen{$word}++;
$prevseen{$prevword . “-” . $word}++;

$prevword = $word;
}
}
}

sub sorted_print_txt {
my @sorted_seen = sort { $seen{$b} <=> $seen{$a}} keys %seen;

open LOG, “>>$outfilename”;
select LOG;

foreach my $w (@sorted_seen){

print ” $w $seen{$w}n”;
}
}


sub sorted_printprev_txt {
my @sorted_prevseen = sort (keys %prevseen);
open LOG, “>>$outfilename”;
select LOG;
foreach my $p (@sorted_prevseen){

print ” $p $prevseen{$p}n”;
}

}


Defamiliarization, Exercises