Regular Regular Expression Exercises

for Regular People

@dancow
http://regex.danwin.com/

What can we find with regular expressions?

Patterns

...well, in text...

(So...maybe stop wasting so much energy turning text into data.)

We can find love

love
Chorus  Now old desire doth in his death-bed lie,
  And young affection gapes to be his heir;
  That fair for which love groan'd for and would die,
  With tender Juliet match'd, is now not fair.
  Now Romeo is beloved and loves again,
  Alike betwitched by the charm of looks,
  But to his foe supposed he must complain,
  And she steal love's sweet bait from fearful hooks:
  Being held a foe, he may not have access
  To breathe such vows as lovers use to swear;
  And she as much in love, her means much less
  To meet her new-beloved any where:
  But passion lends them power, time means, to meet
  Tempering extremities with extreme sweet.

Words that begin with love

\b stands for word boundary
\blove
Chorus  Now old desire doth in his death-bed lie,
  And young affection gapes to be his heir;
  That fair for which love groan'd for and would die,
  With tender Juliet match'd, is now not fair.
  Now Romeo is beloved and loves again,
  Alike betwitched by the charm of looks,
  But to his foe supposed he must complain,
  And she steal love's sweet bait from fearful hooks:
  Being held a foe, he may not have access
  To breathe such vows as lovers use to swear;
  And she as much in love, her means much less
  To meet her new-beloved any where:
  But passion lends them power, time means, to meet
  Tempering extremities with extreme sweet.

Love and only love

\blove\b
Chorus  Now old desire doth in his death-bed lie,
  And young affection gapes to be his heir;
  That fair for which love groan'd for and would die,
  With tender Juliet match'd, is now not fair.
  Now Romeo is beloved and loves again,
  Alike betwitched by the charm of looks,
  But to his foe supposed he must complain,
  And she steal love's sweet bait from fearful hooks:
  Being held a foe, he may not have access
  To breathe such vows as lovers use to swear;
  And she as much in love, her means much less
  To meet her new-beloved any where:
  But passion lends them power, time means, to meet
  Tempering extremities with extreme sweet.

Every word that begins with the letter l

\bl\w+
Chorus  Now old desire doth in his death-bed lie,
  And young affection gapes to be his heir;
  That fair for which love groan'd for and would die,
  With tender Juliet match'd, is now not fair.
  Now Romeo is beloved and loves again,
  Alike betwitched by the charm of looks,
  But to his foe supposed he must complain,
  And she steal love's sweet bait from fearful hooks:
  Being held a foe, he may not have access
  To breathe such vows as lovers use to swear;
  And she as much in love, her means much less
  To meet her new-beloved any where:
  But passion lends them power, time means, to meet
  Tempering extremities with extreme sweet.

Every word at the end of a sentence

i.e. any sequence of word characters followed immediately by a period
\w+\.
Chorus  Now old desire doth in his death-bed lie,
  And young affection gapes to be his heir;
  That fair for which love groan'd for and would die,
  With tender Juliet match'd, is now not fair.
  Now Romeo is beloved and loves again,
  Alike betwitched by the charm of looks,
  But to his foe supposed he must complain,
  And she steal love's sweet bait from fearful hooks:
  Being held a foe, he may not have access
  To breathe such vows as lovers use to swear;
  And she as much in love, her means much less
  To meet her new-beloved any where:
  But passion lends them power, time means, to meet
  Tempering extremities with extreme sweet.

Every word at least 10 characters long

\w{10,}
Chorus  Now old desire doth in his death-bed lie,
  And young affection gapes to be his heir;
  That fair for which love groan'd for and would die,
  With tender Juliet match'd, is now not fair.
  Now Romeo is beloved and loves again,
  Alike betwitched by the charm of looks,
  But to his foe supposed he must complain,
  And she steal love's sweet bait from fearful hooks:
  Being held a foe, he may not have access
  To breathe such vows as lovers use to swear;
  And she as much in love, her means much less
  To meet her new-beloved any where:
  But passion lends them power, time means, to meet
  Tempering extremities with extreme sweet.

Every line without a single T or t

i.e. any line where every character is not the letter T
^[^Tt]+$
Chorus  Now old desire doth in his death-bed lie,
  And young affection gapes to be his heir;
  That fair for which love groan'd for and would die,
  With tender Juliet match'd, is now not fair.
  Now Romeo is beloved and loves again,
  Alike betwitched by the charm of looks,
  But to his foe supposed he must complain,
  And she steal love's sweet bait from fearful hooks:
  Being held a foe, he may not have access
  To breathe such vows as lovers use to swear;
  And she as much in love, her means much less
  To meet her new-beloved any where:
  But passion lends them power, time means, to meet
  Tempering extremities with extreme sweet.

Data cleaning

Find: ((Donald )?Trump)

DONALD TRUMP MAY have bullied his way into the center of the American political arena, but during his Super Tuesday victory speech last week, the real action took place in the margins. That's where you could find Chris Christie apparently lost in reverie, his bewildering expressions instantly stealing the spotlight from Trump's choreographed Main Event. Twitter exploded with interpretations. Was he being held hostage? Realizing he had made a Gob Bluth-like "huge mistake"? Experiencing all the stages of grief simultaneously? Suddenly, Trump was only the nation's second-most-compelling political psychodrama. Within minutes, Viners had re-cut the video, editing Trump out of the frame entirely and zooming in on Christie's face. The supporting player had become the star.

Replace: A Man With The Best Hands

A Man With The Best Hands MAY have bullied his way into the center of the American political arena, but during his Super Tuesday victory speech last week, the real action took place in the margins. That's where you could find Chris Christie apparently lost in reverie, his bewildering expressions instantly stealing the spotlight from A Man With The Best Hands's choreographed Main Event. Twitter exploded with interpretations. Was he being held hostage? Realizing he had made a Gob Bluth-like "huge mistake"? Experiencing all the stages of grief simultaneously? Suddenly, A Man With The Best Hands was only the nation's second-most-compelling political psychodrama. Within minutes, Viners had re-cut the video, editing A Man With The Best Hands out of the frame entirely and zooming in on Christie's face. The supporting player had become the star.

Regex from the command-line with grep (or ack or ag)

globally search a regular expression and print

$ ack 'pattern' somefile
  • Open and read a `somefile`, line by line
  • If a line matches the given `pattern`
  • Print that line

Looking for love at the command-line with ack

  • Type in the name of the program ack
  • Type in the pattern 'love'
  • Type in the filename romeoandjuliet
ack 'love' romeoandjuliet

Looking for long words in all of Romeo and Juliet

  • Any sequence of word-characters, 10-or-more in a row
ack '\w{10,}' romeoandjuliet

Less is sometimes better

You don't need to see all of the text at once

Just show the match (i.e. the long word)

Sometimes we only want to see what we've matched, not the entire line that the match is in.
  • Use the -o flag to specify that we don't want to print the entire line.
ack -o '\w{10,}' romeoandjuliet

Sort the list of long words

Rearrange each printed line in alphabetical order.
  • Use good old Unix pipes, |, to send a line of text from one program to the other
ack -o '\w{10,}' romeoandjuliet  | sort

How many unique words are there, and how many of each?

Only show each unique line once, along with how many times it appeared in the output.
  • Use another pipe, |, to connect another program
  • Why do we have to sort before we count uniques? Because that's how helplessly dumb your computer (and every computer) is.
ack -o '\w{10,}' romeoandjuliet | sort  | uniq -c
   1 triumphant
   3 unaccustom
   1 unattainted
   2 underneath
   1 unfurnished
   1 unpleasing
   1 unreasonable
   1 unsatisfied
   1 unsubstantial
   1 unthankfulness
   1 unworthiest
   1 whispering
   1 wretchedness
   1 yesternight

Make a word cloud (without the cloud part)

Resort the output in numerical order (i.e. frequency of unique word)
  • Use another pipe, |, to connect another call to sort
  • Use sort's -n flag so that it sorts numerically and not alphabetically.
ack -o '\w{10,}' romeoandjuliet |
      sort | uniq -c  | sort -n
   3 confession
   3 counterfeit
   3 inconstant
   3 nightingale
   3 oppression
   3 philosophy
   3 unaccustom
   4 bridegroom
   4 gentlewoman
   5 honourable
   5 lamentable
   7 Apothecary
   7 banishment
   8 churchyard

Search all the files in all the folders

shakespeare-tree.png

Search all the files in all the folders

ack -ho '\w{10,}' shakespeare/ |
    sort | uniq -c | sort -n

Frequency count of "I [something]"

ack -ho 'I \w+' shakespeare/ |
  sort | uniq -c | sort -n
 325 I was
 326 I say
 344 I had
 348 I must
 394 I shall
 395 I pray
 403 I think
 631 I know
 656 I would
 833 I do
1554 I will
1611 I have
1865 I am

Frequency count of "I am [something]"

ack -ho 'I am \w+' shakespeare/ |
  sort | uniq -c | sort -n
  18 I am for
  22 I am too
  23 I am as
  24 I am bound
  25 I am so
  25 I am to
  25 I am your
  35 I am in
  41 I am no
  46 I am sorry
  49 I am the
  59 I am glad
  70 I am sure
 136 I am a
 137 I am not

Frequency count of just the third word in "I am [something]"

ack -h 'I am (\w+)' --output '$1' shakespeare/ |
    sort | uniq -c | sort -n
  22 too
  23 as
  24 bound
  25 so
  25 to
  25 your
  35 in
  41 no
  46 sorry
  49 the
  59 glad
  70 sure
 136 a
 137 not

Frequency count of the fourth word in:
"I am [a or an or the] [something]"

ack -h 'I am (?:a|an|the) (\w+)' \
          --output '$1' shakespeare/ |
      sort | uniq -c | sort -n
   3 son
   3 subject
   4 honest
   4 maid
   4 villain
   5 soldier
   5 true
   5 very
   6 ass
   6 woman
   7 king
   8 poor
  10 man
  19 gentleman

Use the Twitter API to move data...

...from twitter.com...

trump-tweets.jpg

...to plain text

trump-tweets-as-csv.png

...but not to a spreadsheet (yet)

trump-tweets-as-spreadsheet.png

Trump's hashtags?

  • The # character
  • Followed by one-or-more word characters
#\w+
#MakeAmericaGreatAgain #Trump2016"
707722045224833024,2016-03-10 00:17:39 +0000,realDonaldTrump,I will be giving a lengthy interview to @andersoncooper tonight at 8pmE from Florida. Watch on @CNN! #MakeAmericaGreatAgain #Trump2016
#LightweightSenatorMarcoRubio https://t.co/w7G5vJ511O"
#MakeAmericaGreatAgain #Trump2016 https://t.co/hNkkRx66Hc"
707470040954761216,2016-03-09 07:36:16 +0000,realDonaldTrump,Thank you Hawaii! #Trump2016
707388018382995456,2016-03-09 02:10:20 +0000,realDonaldTrump,RT @RSBNetwork: LIVE Stream now: Donald Trump press conference #TrumpTrain #Trump2016 https://t.co/s7WgNXR27A
707386022569623552,2016-03-09 02:02:25 +0000,realDonaldTrump,Thank you Michigan! #Trump2016
707377544224509952,2016-03-09 01:28:43 +0000,realDonaldTrump,Thank you Mississippi! #Trump2016
707374051581362176,2016-03-09 01:14:50 +0000,realDonaldTrump,Thank you! #Trump2016 https://t.co/WvTkL8mMHY
707373529818386434,2016-03-09 01:12:46 +0000,realDonaldTrump,"Thank you America! #Trump2016

realDonaldTrump's most-frequent hashtags?

ack -o '#\w+' realdonaldtrump.csv |
   sort | uniq -c | sort -n
  • Look for the pattern #\w+
  • Print only the match
  • Sort the lines alphabetically
  • Print the count of each unique string
  • Sort the lines numerically

Who does realDonaldTrump mention?

@\w+
  • The @ character
  • Followed by one-or-more word characters
707722839684685824,2016-03-10 00:20:48 +0000,realDonaldTrump,"I will be interviewed in a town hall format on @FoxNews with @seanhannity at 10pmE. Enjoy!
707722045224833024,2016-03-10 00:17:39 +0000,realDonaldTrump,I will be giving a lengthy interview to @andersoncooper tonight at 8pmE from Florida. Watch on @CNN! #MakeAmericaGreatAgain #Trump2016
707388018382995456,2016-03-09 02:10:20 +0000,realDonaldTrump,RT @RSBNetwork: LIVE Stream now: Donald Trump press conference #TrumpTrain #Trump2016 https://t.co/s7WgNXR27A
707353658883706880,2016-03-08 23:53:48 +0000,realDonaldTrump,"I hear @NBCNews / @WSJ came out with another one of their phony polls. While I am leading, they are totally discredited after last S.C. poll"
707351991807315968,2016-03-08 23:47:11 +0000,realDonaldTrump,RT @EricTrump: Aloha Hawaii: We would be honored to have your vote! Find your caucus https://t.co/iNgQS80mxX #TrumpWaikiki #Mahalo https://t.co/ggTwDLbSG7
707226089496690688,2016-03-08 15:26:54 +0000,realDonaldTrump,"""@EricTrump: Wishing our father tremendous luck in Michigan, Mississippi, Idaho & Hawaii today! Amazing father! https://t.co/cYpx4eC9W @Q"""
707225181513719808,2016-03-08 15:23:17 +0000,realDonaldTrump,Thank you @EricTrump! https://t.co/U1iTLmcbAc
707004752756936704,2016-03-08 00:47:23 +0000,realDonaldTrump,I will be interviewed by @SeanHannity tonight at 10pm EST on @FoxNews! Enjoy!

Who does realDonaldTrump most frequently mention?

ack -o '@\w+' realdonaldtrump.csv |
   sort | uniq -c | sort -n
  20 @nytimes
  20 @politico
  21 @TheBrodyFile
  22 @nbcsnl
  23 @AnnCoulter
  23 @greta
  24 @DRUDGE_REPORT
  25 @WSJ
  26 @KarlRove
  26 @MichaelCohen212
  28 @ABC
  35 @EricTrump
  42 @Morning_Joe
  46 @oreillyfactor
  56 @seanhannity
  60 @foxandfriends
  68 @DanScavino
  68 @megynkelly
  80 @JebBush
 165 @CNN
 203 @FoxNews
1191 @realDonaldTrump

What words does @realDonaldTrump associate @JebBush with?

ack -io '\w+ \w+ @JebBush \w+' realdonaldtrump.csv
who supported @JebBush are now
How can @JebBush beat Hillary
time lightweight @JebBush tried to
energy candidate @JebBush has wasted
Sad sack @JebBush has just
to the @JebBush campaign 3
idea for @JebBush whose campaign
sorry for @JebBush and how
millions about @JebBush when he
is that @JebBush mother is
failed candidate @JebBush said that
just past @JebBush in the
the ridiculous @JebBush ad about
idea for @JebBush whose campaign
leaders like @JebBush allowed 9
realDonaldTrump doubles @JebBush in New

What words does @realDonaldTrump associate @nyt accounts with?

ack -o '\w+ \w+ @nyt\w+ \w+' realdonaldtrump.csv
the failing @nytimes are
The failing @nytimes does
The failing @nytimes should
at the @nytimes have
at the @nytimes bought
The failing @nytimes should
in the @nytimes states
of the @nytimes waving
the false @nytimes story
did the @nytimes fall
and the @nytimesworld is

csvcut

Selectively print columns from a delimited text file.
csvcut -c 'column_name_1,column_name_9' somefile

Just print the "Posted at" column

csvcut -c 'Posted at' realdonaldtrump.csv
Posted at
2016-03-10 16:15:45 +0000
2016-03-10 00:20:48 +0000
2016-03-10 00:17:39 +0000
2016-03-09 23:00:30 +0000
2016-03-09 22:39:30 +0000
2016-03-09 21:46:12 +0000
2016-03-09 07:36:16 +0000
2016-03-09 03:58:28 +0000
2016-03-09 02:10:20 +0000
2016-03-09 02:02:25 +0000
2016-03-09 01:28:43 +0000
2016-03-09 01:14:50 +0000
2016-03-09 01:12:46 +0000
2016-03-08 23:58:02 +0000
2016-03-08 23:53:48 +0000

Matching year and month in the timestamp

csvcut -c 'Posted at' realdonaldtrump.csv |
  ack '\d{4}-\d{2}'
  • Print only the Posted at column
  • Match 4 digits, followed by a hypen, followed by 2 digits
2016-03-10 16:15:45 +0000
2016-03-10 00:20:48 +0000
2016-03-10 00:17:39 +0000
2016-03-09 23:00:30 +0000
2016-03-09 22:39:30 +0000
2016-03-09 21:46:12 +0000
2016-03-09 07:36:16 +0000
2016-03-09 03:58:28 +0000
2016-03-09 02:10:20 +0000
2016-03-09 02:02:25 +0000
2016-03-09 01:28:43 +0000
2016-03-09 01:14:50 +0000
2016-03-09 01:12:46 +0000
2016-03-08 23:58:02 +0000

Print the frequency of tweets per month

csvcut -c 'Posted at' realdonaldtrump.csv |
  ack -o '(?<= )\d{2}(?=:)' |
  sort | uniq -c
 212 2015-07
 441 2015-08
 441 2015-09
 815 2015-10
 599 2015-11
 634 2015-12
 492 2016-01
 495 2016-02
 152 2016-03

Print the frequency of tweets per month as a CSV

csvcut -c 'Posted at' realdonaldtrump.csv |
  ack -o '\d{4}-\d{2}' |
  sort | uniq -c |
  ack '(\d+) (.+)' --output '$2,$1'
2015-07,212
2015-08,441
2015-09,441
2015-10,815
2015-11,599
2015-12,634
2016-01,492
2016-02,495
2016-03,152

Import tweet-frequency-as-CSV into Excel

trump-tweets-years.jpg

Print the frequency of tweets per day

csvcut -c 'Posted at' realdonaldtrump.csv |
  ack -o '\d{4}-\d{2}-\d{2}' |
  sort | uniq -c |
  ack '(\d+) (.+)' --output '$2,$1'
trump-tweets-excel-days.jpg
2015-07-25,22
2015-07-26,39
2015-07-27,48
2015-07-28,33
2015-07-29,28
2015-07-30,27
2015-07-31,15
2015-08-01,7
2015-08-02,3
2015-08-03,47
2015-08-04,29
2015-08-05,21
2015-08-06,18
2015-08-07,60
2015-08-08,9
2015-08-09,14

Print the frequency of tweets per hour

csvcut -c 'Posted at' realdonaldtrump.csv |
  ack -o '(?<= )\d{2}(?=:)' |
  sort | uniq -c |
  ack '(\d+) (.+)' --output '$2,$1'
trump-tweets-excel-hours.jpg
00,250
01,248
02,264
03,255
04,138
05,89
06,50
07,23
08,42
09,57
10,85
11,174
12,230
13,242
14,216
15,233
16,194
17,190
18,198
19,237
20,231
21,211
22,187
23,237

Working with Messy Text

wsj-clinton-emails.jpg

via pdf/HRCEmail_NovWeb/C05793914.pdf

clintonpdfemailsecret.jpg

via HRCEmail_NovWeb/C05793914.txt

       UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05793914 Date: 11/30/2015

Classified by DAS, A/GIS, DoS on 05/22/2015 — Class
: SECRET/NOFORN — Reason: 1.4(D) — Declassify on:                                               RELEASE IN PART
11/18/2032                                                                                      B1 ,B7(D),B7(E),1 .4(D),B7(A)

From:                               Sullivan, Jacob J <Sullivan.11@state.gov>
Sent:                               Sunday, November 18, 2012 8:44 PM
To:
Subject:                             Fw: FYI - Report of arrests -- possible Benghazi connection



 Fyi


 From: Jones, Beth E
 Sent: Sunday, November 18, 2012 07:14 PM
 To: Burns, William 3; Sherman, Wendy R; Sullivan, Jacob 3
 Cc: Dibble, Elizabeth L; Roebuck, William V
 Subject: Fw: FYI - Report of arrests -- possible Benghazi connection

 This preliminary, but very interesting, pls see below. FBI in Tripoli is fully involved.


 From: Roebuck, William V
 Sent: Sunday, November 18, 2012 07:01 PM
 To: Jones, Beth E; Maxwell, Raymond D; Abdalla, Alyce N; Sidereas, Evyenia; Miller, James N
 Subject: FYI - Report of arrests -- possible Benghazi connection



Capturing the From: field

In every email file, look for the first instance of From:, followed by a bunch of white space, and then capture the first non-whitespace character to the end of the line
ack -h -m 1 -A3 '^From: +(.+)' txt/
From:                             Sullivan, Jacob J <SullivanJJ@state.gov>
Sent:                             Friday, December 4, 2009 3:31 AM
To:
Subject:                          Iran
--
From:                              Mills, Cheryl D <MillsCD@state.gov>
Sent:                              Friday, December 4, 2009 9:35 AM
To:                                H
Subject:                           FW: Frm SecState Trip to Middle East
--
From:                               Abedin, Huma <AbedinH@state.gov>
Sent:                               Friday, December 4, 2009 1:03 PM

Frequency count of the From: field

In every email file, look for the first instance of From:, followed by a bunch of white space, and then capture the first non-whitespace character to the end of the line
ack -h -m 1 '^From: +(.+)' txt/ --output '$1' |
     sort | uniq -c | sort -n
 189 Abedin, Huma <AbedinH©state.gov>
 210 Coleman, Claire L <ColemanCL@state.gov>
 219 Huma Abedin <Huma@clintonemail.com >
 227 Mills, Cheryl D <MillsCD©state.gov>
 234 Mills, Cheryl D <MillsCD@state.gov >
 240 Sullivan, Jacob J <Sullivann@state.gov>
 307 Valmoro, Lona J <ValmoroU@state.gov>
 372 sbwhoeop
 431 Hanley, Monica R <HanleyMR@state.gov>
 532 Jiloty, Lauren C <JilotyLC@state.gov>
2309 Sullivan, Jacob J <SullivanJJ@state.gov>
2768 Abedin, Huma <AbedinH@state.gov>
3257 H <hrod17@clintonemail.com>
3664 Mills, Cheryl D <MillsCD@state.gov>
3927 H <hrod17@clintonemail.com >

We can't find SECRETs on a public app

Many search engines aren't granular enough to look for just "SECRET", but not "secret" nor "Secretary"

wsjsecrets.jpg

Find our own CONFIDENTIAL SECRETs

List the filenames in which the text matches SECRET or CONFIDENTIAL, case-sensitive
ack -l '\b(SECRET|CONFIDENTIAL)\b' txt/
txt/HRCEmail_SeptemberWeb/C05784269.txt
txt/HRCEmail_SeptemberWeb/C05784272.txt
txt/HRCEmail_SeptemberWeb/C05784307.txt
txt/HRCEmail_SeptemberWeb/C05784596.txt
txt/HRCEmail_SeptemberWeb/C05784612.txt
txt/HRCEmail_SeptemberWeb/C05784643.txt
txt/HRCEmail_SeptemberWeb/C05784686.txt
txt/HRCEmail_SeptemberWeb/C05784944.txt
txt/HRCEmail_SeptemberWeb/C05785408.txt
txt/HRCEmail_SeptemberWeb/C05785471.txt
txt/HRCEmail_SeptemberWeb/C05785569.txt
txt/HRCEmail_SeptemberWeb/C05785570.txt
txt/HRCEmail_SeptemberWeb/C05785571.txt
txt/HRCEmail_SeptemberWeb/C05785574.txt
txt/HRCEmail_SeptemberWeb/C05785575.txt

Frequency count of the From: field

...but only for SECRET and CONFIDENTIAL emails
ack -l '\b(SECRET|CONFIDENTIAL)\b' txt/ |
    xargs ack -h -m 1 '^From: +(.+)' --output '$1' |
    sort | uniq -c | sort -n
   9 Verveer, Melanne S <VerveerMS@state.gov >
  10 Abedin, Huma <AbedinH©state.gov>
  10 Jiloty, Lauren C <JilotyLC@state.gov>
  10 Sherman, Wendy R <ShermanWR@state.gov>
  10 Sidney Blumenthal <
  12 Sullivan, Jacob J <Sullivan.11@state.gov>
  14 Hanley, Monica R <HanleyMR@state.gov>
  14 Sullivan, Jacob J <SullivanJJ©state.gov>
  15 Sidney Blumenthal
  16 Sullivan, Jacob J <Sullivanil@state.gov>
  18 Sullivan, Jacob J <Sullivan11@state.gov>
  23 Sullivan, Jacob 1 <SullivanJJ@state.gov>
  35 Sullivan, Jacob J <SullivanJJ@state.gov >
  39 Verveer, Melanne S <VerveerMS@state.gov>
  49 Sullivan, Jacob J <Sullivann@state.gov>
  80 Abedin, Huma <AbedinH@state.gov>
 122 sbwhoeop
 138 Mills, Cheryl D <MillsCD@state.gov>
 257 H <hrod17@clintonemail.com>
 288 H <hrod17@clintonemail.com >
 374 Sullivan, Jacob J <SullivanJJ@state.gov>

Sometimes we just want to find just u

The letter u, by itself, when it is followed another word.

ack -i '\bu \w+' txt/
txt/HRCEmail_SeptemberWeb/C05782117.txt:
69:Subject: Call update - are u ok w this?
txt/HRCEmail_SeptemberWeb/C05782396.txt:
16:Want to be sure u saw
txt/HRCEmail_SeptemberWeb/C05782445.txt:
16:In case u didn't get
txt/HRCEmail_SeptemberWeb/C05782447.txt:
29:In case u didn't get
txt/HRCEmail_SeptemberWeb/C05782533.txt:
16:Hi there. World. Leader. I feel like its 1000 yrs since we've talked. Thinking of u as 9/11 approaches n remembering that
txt/HRCEmail_SeptemberWeb/C05782533.txt:
17:time together the sweet night of 9/10 when u threw. Kay a baby shower. --that horrorific next morn n all the wrenchng
txt/HRCEmail_SeptemberWeb/C05782736.txt:
16:
U have break from 9:45-1
txt/HRCEmail_SeptemberWeb/C05782749.txt:
43:I kind of heard that u were setting up or thinking about a foreign policy advisory council -- can't remember exactly. If you
txt/HRCEmail_SeptemberWeb/C05782808.txt:
40:Hi there. World. Leader. I feel like its 1000 yrs since we've talked. Thinking of u as 9/11 approaches n remembering that
txt/HRCEmail_SeptemberWeb/C05782808.txt:
41:time together the sweet night of 9/10 when u threw. Kay a baby shower. ---that horrorific next morn n all the wrenchng
txt/HRCEmail_SeptemberWeb/C05782878.txt:
15: III get to u rob
txt/HRCEmail_SeptemberWeb/C05782908.txt:
89:interest rates. In his testimony before a parliamentary committee, Governor Ndung'u said other government
txt/HRCEmail_SeptemberWeb/C05783019.txt:
19:Any that u want us to set up, let me know. You have all call sheets. Aliyev would be good to do if you can.
txt/HRCEmail_SeptemberWeb/C05783169.txt:
41:Hrc asked melanne to follow up on the unilever thing. Were u listening? She wants us to offer to help, to give them video etc.
txt/HRCEmail_SeptemberWeb/C05783170.txt:
36:PIR brilliant, yes definitely buy right away. H should own If u can't get .com get .co
txt/HRCEmail_SeptemberWeb/C05783170.txt:
78:I kind of heard that u Were setting up or thinking about a foreign policy advisory council -- can't remember exactly. If you
txt/HRCEmail_SeptemberWeb/C05784072.txt:
16:Ok u know you have had a good run when Alter writes a great piece.
txt/HRCEmail_SeptemberWeb/C05784084.txt:
16:Ok u know you have had a good run when Alter writes a great piece.
txt/HRCEmail_SeptemberWeb/C05784135.txt:
32:Ok u know you have had a good run when Alter writes a great piece.
txt/HRCEmail_SeptemberWeb/C05785461.txt:
20:Oscar, can u print this call sheet for hrc. For gary ackerman. Thanks.

Dealing with Big (for Excel) data as text

The Federal Election Commissions Individual Donors list

fec-database-list.png

Up to 1.3 million records for 2015-2016 alone

Trying to import the individual donors data into Excel

excel-import-pipes.png

But Excel can't handle more than a million rows.

excel-import-pipes-FAIL.png

Filtering individual contributors by (self-proclaimed) job title

fec-indiv-schema.png

Looking for just the lawyers who contribute

ack 'LAWYER|ATTORNEY' indiv-contributions.txt
  

Looking for campaign contributors who are lawyers

Filter for lawyers, change the pipes to commas, then dump into a text file that Excel can read.
ack 'LAWYER|ATTORNEY' indiv-contributions.txt |
       csvformat -d '|' > lawyers.csv
  • Search for LAWYER or ATTORNEY
  • Convert the pipe-delimiters to commas
  • Dump the text into lawyers.csv

Result: 89,034 rows (small enough for Excel).

fec-lawyers-excel.png

Fast and fun data cleaning

Do you know how to spell "Masha...cket Pequot"?

  • Search for MASH
  • followed by one-or-more uppercase letters
  • followed by CKET
ack 'MASH[A-Z]+CKET' indiv-contributions.txt
C00499947|A|Q2|P|201507229000355836|11|IND|TRIBAL NATION, MASHANTUCKET PEQUOT|MASHANTUCKET|CT|063383008|MASHANTUCKET PEQUOT TRIBAL NATION|INDIAN TRIBE|06292015|2700||C10408456|1017371|||4072320151247723686
C00350520|N|Q2|P|201507159000168044|11|ORG|MASHANTUCKET PEQUOT TRIBAL NATION|MASHANTUCKET|CT|06338|||06292015|1000||INCA57787|1015196|||4071520151247122864
C00459925|N|MY|P|201507309000462796|11|ORG|MASHANTUCKET PEQUOT TRIBAL NATION|MASHANTUCKET|CT|063383008|||06302015|400||C9365215|1019170|||4073120151248086910
C00505552|N|Q2|P|201507150200189449|11||MASHANTUCKET PEQUOT TRIBAL NATION|MASHANTUCKET|CT|063383008|||05132015|2700||SA072715944929|1018152|||2072820151247977899
C00167320|N|M8|P|201508209000867724|11|ORG|MASHANTUCKET PEQUOT TRIBAL NATION|MASHANTUCKET|CT|06338|||07102015|1185||C21496835|1022541|||4082020151249343429   

Does the Mashantucket Pequot know how to spell "Mashantucket"?

  • Search for FU
  • followed by any word character
  • followed by a K character
ack -o 'MASH[A-Z]+CKET' indiv-contributions.txt |
  sort | uniq -c
   2 MASHANFUCKET
  56 MASHANTUCKET

Just how many F*CKs have been given to campaigns in 2015-2016?

  • Search for FU
  • followed by any word character
  • followed by a K character
ack -o 'FU\wK' indiv-contributions.txt | sort | uniq -c
   5 FUCK
  52 FULK
 207 FUNK
   4 FURK
  10 FUSK

And who exactly are these F*CKs?

Let's see some context around the word (but not the entire line)
  • Any 25 characters
  • Then, the sequence FUCK
  • Then, the next 25 characters
ack -o '.{25}FUCK.{25}' indiv-contributions.txt
2010200020023|21Y||MASHANFUCKET PEQUOT TRIBAL NATION|M
2010200020023|21Y||MASHANFUCKET PEQUOT TRIBAL NATION|M
420|SELF EMPLOYED MOTHER FUCKER|ELECTRICIAN|12102015|5
481769315|HEARST|BAD-ASS FUCKING PROFESSIONAL CHILD|12
180|SELF EMPLOYED MOTHER FUCKER|ELECTRICIAN|12102015|5

Not every important data story requires analysis

Think about why we put up with spreadsheets and databases in the first place? Putting data into Excel is not the end goal.

Nothing wrong with PDFs when you know exactly what you're looking for

(...and/or need to print it out)

astrazeneca.pdf.png

Do you really need data?

Result of running pdftotext -layout on the PDF file:

astrazeneca.txt.png

...Or do you need just the biggest outlier, and to find it the fastest?

astrazeneca-bigmoney.txt.png

Find the biggest dollar amount

No calculator/spreadsheet/database needed: whatever string of text that starts with a $, has at least 2 or more digits before a comma, and then has 3 more digits. And let's throw in the state abbreviation for good measure:
ack -o '[A-Z]{2} +\$\d{2,3},\d{3}' payments.txt