If you have been following this series of posts looking at emails on financial service products that mostly went into my spam folder and wondered how have I been doing it, here’s the answer.
The download: This one is very easy, GMail automatically moves what it considers spam emails into a spam folder (which is actually a label). Go to Google Takeout and follow these steps selecting options to download emails labeled ‘spam’. Viola within minutes you get to download all those emails in one monolithic MBOX format.
The extraction: I use this tool called MBOX converter which gives a nice option to save the headers in csv format (among the many others). The only problem is the trial version only allows you to convert 25 emails. There are tons of scripts in Python and Perl that you can find online that do the same job as well. The only issue is that you loose some details like SPF, DKIM and DMARC which can be viewed and extracted if you look at the original message.
The Visualization: I have been using wordle for a few years now which is a neat little tool to generate word clouds from text provided. All I have to do is dump the headers and once I get the cloud, I tweak it to the font, layout, and color I want.
The conversion of mail body: The analysis so far has been limited to the headers as they can be easily extracted as text. I also want to analyse the body of these emails. But the problem is that most of these emails are images and not text to extract. This is were I want to put machine learning to use. So far I have tried computer vision from Microsoft and it works to some extent in extracting text from these images. The other one I have been fiddling around with is Vision API from Google. The big issue is that I have 240 (+ another 200) images of full emails now and I have to program it. More on this some other time.