Blueprint for Sec filings Data Pipeline (ETL)

A popular source of information are the filings held in the EDGAR database. EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system, performs automated collection, validation, indexing, acceptance, and forwarding of submissions by companies and others who are required by law to file forms with the U.S. Securities and Exchange Commission SEC (U.S. Securities and Exchange Commission SEC 2007).

Companies comply with these acts principally through the completion and submission (i.e., filing) of standardized forms issued by the SEC.  There are more than 50 different types of SEC forms, some of the more important ones include:

  • Forms 10-K, 20-F, and 40-F: These are forms that companies are required to file annually. Form 10-K is for U.S. registrants, Form 40-F is for certain Canadian registrants, and Form 20-F is for all other non-U.S. registrants. These forms require a comprehensive overview, including information concerning a company’s business, financial disclosures, legal proceedings, and information related to management.
  • Forms 10-Q and 6-K: These are forms that companies are required to submit for interim periods (quarterly for U.S. companies on Form 10-Q, semiannually for many non-U.S. companies on Form 6-K
  • Form 8-K: companies must file with the SEC to announce such major events as acquisitions or disposals of corporate assets, changes in securities and trading markets, matters related to accountants and financial statements, corporate governance and management changes, and Regulation FD disclosures.

Steps involved in building a data pipeline:

  • 1) Query list of latest company filings available on SEC website
    • need to determine whether any are new, so need to pull fields to compare vs database
    • technology: python, feedparser, redis?
  • 2) Acquiring and storing full document
    • process: add to download queue, determine download location, name, (priority of document)
    • technology: python, celery, rabbitmq, redis, postgresql/sql server
  • 3) Processing document and storing processed data
    • formatting downloaded data for analysis/storage (xbrl xml parsing), run analysis of growth rates, metrics, etc
    • technology:  python, redis
  • 4) Analyzing new data
    • determine what was reported vs expectations, does new data change future expectations
    • technology: visualization tableau, matlab, neo4j/graph?,excel
  • 5) Take action on new data
    • trade
    • technology: interactive broker, bloomberg


Leave a Reply

Your email address will not be published. Required fields are marked *