Click here to Skip to main content
15,915,873 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have been working on a Python coded priority email inbox, with the ultimate aim of using a machine learning algorithm to label (or classify) a selection of emails as either important or un-important. I will begin with some background information and then move into my question.

I have so far developed code to extract data from an email and process it to discover the most important ones. This is achieved using the following email features:

Senders Address Frequency
Thread Activity
Date Received (time between replies)
Common Words in body/subject

The code I have currently applies a ranking (or weighting) (value 0.1-1) to each email based on its importance and then applies a label of either ‘important’ or ‘un-important’ (In this case this is just 1 or 0). The status of priority is awarded if the rank is >0.5. This data is stored in a CSV file (as below).

From Subject Body Date Rank Priority
test@test.com HelloWorld Body Words 10/10/2012 0.67 1
rest@test.com ByeWorld Body Words 10/10/2012 0.21 0
best@test.com SayWorld Body Words 10/10/2012 0.9 1
just@test.com HeyWorld Body Words 10/10/2012 0.48 0
etc …………………………………………………………………………

I have two sets of email data (One Training, One Testing). The above applies to my training email data. I am now attempting to train a learning algorithm so that I can predict the importance of the testing data.

To do this I have been looking at both SCIKIT and NLTK. However, I am having trouble transferring the information I have learnt in the tutorials and implementing into my project. I have no particular requirements in regards to which learning algorithm is used. Is this as simple as applying the following? And if so how?

X, y = email.data, email.target

from sklearn.svm import LinearSVC
clf = LinearSVC()

clf = clf.fit(X, y)

X_new = [Testing Email Data]

clf.predict(X_new)
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900