Keywording threads

some interesting results...

15 messages
03/05/2012 at 17:21

I've been trying to collate 'The thoughts of captain paranoia' again, prompted by this recent DofE malarkey.

What I wanted to do was find a tool to examine the saved 'print threads', and automatically find keywords, and produce lists of threads that contain these words.  But, despite the fact that we have played with information mining things like this a bit (not personally), I couldn't find a freeware tool to do this.

So I've been playing with the Unix 'productivity tools', such as tr, grep, awk & sort, and have now analysed all my contributions to the threads I've posted on.  This has produced a word frequency table, which I find quite interesting.

As is to be expected, the most frequent are to be found in the usual list of most common English words, but it's not long before interest-specific words start to appear.  e.g. the first such is at #33, being 'water'.  Then more appear:

<not consecutive>
jacket
fabric
layer
fleece
shell
waterproof
fuel
base
windproof
top
bag
gps
montane
air
lightweight
meths
pertex

<then little clusters start to appear>
gear
small
cheap
wear
warm
pan
walking

right
product
idea

pound
insulation
gas
paramo
ml

experience
fabrics
clothing

map
help
climbing

I find it interesting that this actually says quite a lot about my postings, and the things I post about...

And yes, I know that I need to get out more...

03/05/2012 at 17:27

Please don't feel impelled to do this for my postings

03/05/2012 at 17:56

> Please don't feel impelled to do this for my postings

Done, but we have a very small overlap, so I suspect the figures aren't significant...

The first uncommon ones are

thought
sleep
people
weather
thanks
mountains
milford
maternity
leave
hassles
cold
chance
attractions
unrealistic
traffic
sleeping

Does that sound familiar...?

03/05/2012 at 18:00

Phew...

At least I mentioned mountains

03/05/2012 at 21:02
Cp,

Not the bloody grep command
03/05/2012 at 21:23
Global Report Exception Print..... I knew that ONE DAY knowing that would come in handy!!

BTW CP do you have access to the database?
03/05/2012 at 23:50
Well if you did that to my posts you'd need a super computer to wade through it all and the list of words would probably be a large list!!!
PS shortest post for awhile.
Pps don't tell me the average without,on my posts. If you can find that out of course.
04/05/2012 at 06:48

CP

Have you taken into account that words are sometimes mis-spelt by posters? Probably the most frequent error is 'cannister'.

Go on, tell us if my guess is correct!

Hugh

04/05/2012 at 07:18

Tune in for the next grepping episode ...

IGMC

04/05/2012 at 13:49

> Have you taken into account that words are sometimes mis-spelt by posters?

Before I extracted just my posts from the threads in order to perform the analysis, there were about 77k 'unique' words in the resulting words report.  Lots of them were speeling mistooks.  Unfortunately, I'm using cygwin, rather than a true Unix, so I don't have access to 'spell'.

My own words list is about 21k.

I don't have access to the thread database, just the threads I've saved from OM's printthreads.  I keep meaning to learn how to automate http accesses and save the files automatically, as it would save me the tedious task of manually saving.

> Global Report Exception Print

grep is generally understood to mean 'get regular expression (print)', although I note a different origin in Wiki, that of a command sequence for ed.

Unix productivity tools are very powerful, and allow you to do all sorts of things that DOS and Windows don't.  Provided you've spent years sitting on a bean-bag, wearing sandals, and growing long hair and a beard...

04/05/2012 at 13:50

Here's the script, in case anyone's interested...

# report original size
echo '*.htm'
cat *.htm | wc
# remove thread title (.htm)
# convert all non-alpha characters to newline (breaks into single-word lines, removes html tags)
# translate to lower case
# remove blank lines
# sort, ignoring case
egrep -v '\.htm' $1.src | tr -c [A-Za-z\'] '\n' | tr [A-Z] [a-z] | egrep [:alpha:] | sort -f > $1.txt
wc $1.txt
# count duplicates
awk -f count.awk $1.txt | sort -n -r > $1.cnt
wc $1.cnt
# remove duplicates
sort -u $1.txt > $1.srt
wc $1.srt

04/05/2012 at 13:50

And the counting AWK script, count.awk

# script to count instances of words
BEGIN {counting=0}

{if (counting == 1)
    {
    if (word == $1)
        count++
    else
        {
        printf("%06d %s\n", count, word)
        count=1
        }
    }
}

{word = $1}
{counting = 1}

END {
        printf("%06d %s\n", count, word)
}
 

04/05/2012 at 14:48

Weekend is looking nice.....what's the code command for cease work, shift focus to Lochaber?

No, never mind......found it.

04/05/2012 at 15:18

> what's the code command for cease work

shutdown -h now 'shifting focus to Locahaber'

09/05/2012 at 21:21
Someone needs to get out more?  But fascinating, of course!
Your say
email image
15 messages
Forum Jump  
Sign up to our weekly newsletter

Competitions

Sign up to our twitter feed

Promotions