Normalizing LinkedIn Industry Classification

Since research is such an important aspect of investment process I’m always looking for ways to improve it.  One of the problems is that because there is so much data both in the past and being released at any given time, just trying to get grip on the firehose can seem pointless to say the least.  While I’m by no means close to getting a grip on the firehose, I’ve determined that an important aspect of getting to that point is dependent on how one classifies the data and is therefore become an interest of mine over past few years.
That being said, I came across Linkedin’s Industry Classifications in some HTML code and figured show how to use python to get and store the data.

This is the data:

‘[‘”></’, ‘”>Accounting</’, ‘”>Alternative Dispute Resolution</’, ‘”>Alternative Medicine</’, ‘”>Animation</’, ‘”>Automotive</’, ‘”>Banking</’, ‘”>Biotechnology</’, ‘”>Broadcast Media</’, ‘”>Building Materials</’, ‘”>Capital Markets</’, ‘”>Chemicals</’, ‘”>Civil Engineering</’, ‘”>Commercial Real Estate</’, ‘”>Computer Games</’, ‘”>Computer Hardware</’, ‘”>Computer Networking</’, ‘”>Computer Software</’, ‘”>Construction</’, ‘”>Consumer Electronics</’, ‘”>Consumer Goods</’, ‘”>Consumer Services</’, ‘”>Cosmetics</’, ‘”>Dairy</’, ‘”>Design</’, ‘”>Education Management</’, ‘”>Entertainment</’, ‘”>Environmental Services</’, ‘”>Events Services</’, ‘”>Executive Office</’, ‘”>Facilities Services</’, ‘”>Farming</’, ‘”>Financial Services</’, ‘”>Fine Art</’, ‘”>Fishery</’, ‘”>Food Production</’, ‘”>Fundraising</’, ‘”>Furniture</’, ‘”>Government Administration</’, ‘”>Government Relations</’, ‘”>Graphic Design</’, ‘”>Higher Education</’, ‘”>Hospitality</’, ‘”>Human Resources</’, ‘”>Industrial Automation</’, ‘”>Information Services</’, ‘”>Insurance</’, ‘”>International Affairs</’, ‘”>Internet</’, ‘”>Investment Management</’, ‘”>Judiciary</’, ‘”>Law Enforcement</’, ‘”>Law Practice</’, ‘”>Legal Services</’, ‘”>Legislative Office</’, ‘”>Libraries</’, ‘”>Machinery</’, ‘”>Management Consulting</’, ‘”>Maritime</’, ‘”>Market Research</’, ‘”>Mechanical or Industrial Engineering</’, ‘”>Media Production</’, ‘”>Medical Device</’, ‘”>Medical Practice</’, ‘”>Mental Health Care</’, ‘”>Military</’, ‘”>Music</’, ‘”>Nanotechnology</’, ‘”>Newspapers</’, ‘”>Nonprofit Organization Management</’, ‘”>Online Publishing</’, ‘”>Performing Arts</’, ‘”>Pharmaceuticals</’, ‘”>Philanthropy</’, ‘”>Photography</’, ‘”>Plastics</’, ‘”>Political Organization</’, ‘”>Printing</’, ‘”>Professional Training</’, ‘”>Program Development</’, ‘”>Public Policy</’, ‘”>Public Relations</’, ‘”>Public Safety</’, ‘”>Publishing</’, ‘”>Railroad Manufacture</’, ‘”>Ranching</’, ‘”>Real Estate</’, ‘”>Religious Institutions</’, ‘”>Research</’, ‘”>Restaurants</’, ‘”>Retail</’, ‘”>Semiconductors</’, ‘”>Shipbuilding</’, ‘”>Sporting Goods</’, ‘”>Sports</’, ‘”>Supermarkets</’, ‘”>Telecommunications</’, ‘”>Textiles</’, ‘”>Think Tanks</’, ‘”>Tobacco</’, ‘”>Utilities</’, ‘”>Venture Capital</’, ‘”>Veterinary</’, ‘”>Warehousing</’, ‘”>Wholesale</’, ‘”>Wireless</’]’

from time import localtime
f = open(r'C:projectsLinkedinraw_linkedin_industry_classification.txt')
 
 
line = f.readline()
f.close()
pattern = re.compile(r'[^w,+]') # regex not replace a-z, not spaces
nline = pattern.sub(" ", line) # substitue matches with spaces
list = nline.split(',') #split into list by comma
stripped_list = [j.strip() for j in list] # strip leading and trailing whitespace
 
 
csvfile = open(r'C:projectsLinkedinlinkedin_industry_classification_.csv', 'w') # open file for writing cleaned data
csvfile.write(pprint.pformat(stripped_list) + 'n') #write list to new lines
csvfile.close()


data-text.csv:
Industry
Accounting
Alternative Dispute Resolution
Alternative Medicine
Animation
Automotive
Banking
Biotechnology
Broadcast Media
Building Materials
Capital Markets
Chemicals
Civil Engineering
Commercial Real Estate
Computer Games
Computer Hardware
Computer Networking
Computer Software
Construction
Consumer Electronics
Consumer Goods
Consumer Services
Cosmetics
Dairy
Design
Education Management
Entertainment
Environmental Services
Events Services
Executive Office
Facilities Services
Farming
Financial Services
Fine Art
Fishery
Food Production
Fundraising
Furniture
Government Administration
Government Relations
Graphic Design
Higher Education
Hospitality
Human Resources
Industrial Automation
Information Services
Insurance
International Affairs
Internet
Investment Management
Judiciary
Law Enforcement
Law Practice
Legal Services
Legislative Office
Libraries
Machinery
Management Consulting
Maritime
Market Research
Mechanical or Industrial Engineering
Media Production
Medical Device
Medical Practice
Mental Health Care
Military
Music
Nanotechnology
Newspapers
Nonprofit Organization Management
Online Publishing
Performing Arts
Pharmaceuticals
Philanthropy
Photography
Plastics
Political Organization
Printing
Professional Training
Program Development
Public Policy
Public Relations
Public Safety
Publishing
Railroad Manufacture
Ranching
Real Estate
Religious Institutions
Research
Restaurants
Retail
Semiconductors
Shipbuilding
Sporting Goods
Sports
Supermarkets
Telecommunications
Textiles
Think Tanks
Tobacco
Utilities
Venture Capital
Veterinary
Warehousing
Wholesale
Wireless

Leave a Reply

Your email address will not be published. Required fields are marked *