i have a lot of raw data but i need to categorize it for it to be useful.
i think i want to go with just three dimensions of categorization
- tech subindustry
- gender
- positivity
tech categorization
tech categorization is going to be really imperfect but i made a bag of words based on my 4000 twitter follows
jslist = [ 'react', 'webpack', ' js', 'javascript','frontend', 'front-end', 'underscore','entscheidungsproblem', 'meteor']
osslist = [' oss', 'open source','maintainer']
designlist = ['css', 'designer', 'designing']
devlist = [' dev','web dev', 'webdev', 'code', 'coding', 'eng', 'software', 'full-stack', 'fullstack', 'backend', 'devops', 'graphql', 'programming', 'computer', 'scien']
makerlist = ['entrepreneur', 'hacker', 'maker', 'founder', 'internet', 'web']
def categorize(x):
bio = unicode(x).lower()
if any(s in bio for s in jslist):
return 'js'
elif any(s in bio for s in osslist):
return 'oss'
elif any(s in bio for s in designlist):
return 'design'
elif any(s in bio for s in devlist):
return 'dev'
elif any(s in bio for s in makerlist):
return 'maker'
else:
return ''
cleanedfinal['cat'] = map(categorize,cleanedfinal['bios'])
print(len(cleanedfinal[cleanedfinal['cat'] == 'maker'])) # 573
print(len(cleanedfinal[cleanedfinal['cat'] == 'design'])) # 136
print(len(cleanedfinal[cleanedfinal['cat'] == 'oss'])) # 53
print(len(cleanedfinal[cleanedfinal['cat'] == 'js'])) # 355
print(len(cleanedfinal[cleanedfinal['cat'] == 'dev'])) # 758
this gives me 1875 categorized twitter accounts.
gender
- https://github.com/tue-mdse/genderComputer # this did not work, filed an issue
- https://github.com/muatik/genderizer # this works!!!!
- https://www.kaggle.com/crowdflower/twitter-user-gender-classification # this is nice raw data for ML in future
positivity
using textblob: https://pypi.python.org/pypi/textblob
you need this: https://stackoverflow.com/questions/26570944/resource-utokenizers-punkt-english-pickle-not-found