Correlation between stock performance and financial ratios for S&P500 companies

2022-02-26
// A brief summary of my bachelors thesis i wrote ~ 10 months ago
// Full results and code can be found here -> github.com/tompston/SP500_correlation

There are some places that write these "This financial ratio is a great predictor of stock growth" type of articles. This always seemed like clickbait to me. So i wanted to test if there is some actual validity to these claims.


So how can we test that claim? Scrape financial statement data, combine it and run it through correlation calculations.

Data gathereing

I decided to scrape the companies that were included in the S&P500 index. Why? Because there is a high enough probability that, if you're an investor, you would be interested in a company that is listed there. Thus, figuring out if there's any correlation between stock increase and a financial ratio for this group would be beneficial for some investors.


For data gathering, i used the Financial Modeling Prep API. During research for an API to use, it seemed like one of the best options (not sponsored). In total 8 endpoints were scraped for each stock ticker included in the index. These include:


  1. income-statement
  2. balance-sheet-statement
  3. cash-flow-statement
  4. ratios
  5. financial-growth
  6. income-statement-growth
  7. key-metrics
  8. enterprise-values

What's cool about Financial Modeling Prep was that they provided endpoints that hold most of the financial ratios already calculated. This means that we don't need to worry about doing that ourselves.


The free version of the API provides data for the last 4 years. This means that if we scrape 500 companies x 4 years, we would have 2000 years as a reference for calculations.


The API accepts the ticker as an url param, so all we need to do to gather the data is:

  1. Get the list of tickers inluded in the index ( scraped from Wikipedia )
  2. Convert that scraped list into an array
  3. Loop over the array, passing the ticker as an url param and save all of the fetced data as a single csv file for each endpoint

Once that is done, we have 8 seperate csv files that hold the scraped data. How do we combine it? We can create a unique ID for the row that then could be joined together. What can we use to create that unique ID? All of the scraped csv files always have 2 columns -> ticker name + date. Combine these two columns together and voila, we have a unique ID.

Stock increase percentage

Luckily for me, the enterprise-value endpoint also had info about the stock price on the date when the financial reports were published. This means that we can use it to calculate the annual stock increase percentage easily.

Results

After combining all of the scraped data together we have a csv file of 2484 rows (years) and 327 columns (financial variables). Note that there are columns with repeated information, so in reality we have a lot less columns that will be used in correlation calculations. For example, the Pearsons correlation results have 275 rows of financial variables and a lot of the rows hold duplicate financial variables.

Correlation results


  • Statistics for the calculated annual stock incrase percentage

    enterprise_value.annual_stock_increase
    count 1983
    mean 0.137970869
    std 0.287600395
    min -0.724216618
    25% -0.051685853
    50% 0.116157779
    75% 0.291180917
    max 2.098626046

  • Pearsons correlation

    enterprise_value.annual_stock_increase
    enterprise_value.annual_stock_increase 1
    key_metrics.stockBasedCompensationToRevenue 0.189647428
    financial_growth.revenueGrowth 0.157458114
    enterprise_value.stockPrice 0.155575331
    income_statement_growht.growthRevenue 0.146234174

  • Spearmans correlation
    enterprise_value.annual_stock_increase
    enterprise_value.annual_stock_increase 1
    key_metrics.enterpriseValueOverEBITDA 0.360546333
    ratio.enterpriseValueMultiple 0.360402587
    key_metrics.pocfratio 0.351598389
    ratio.priceCashFlowRatio 0.351425033

  • Ordinary Least Sqaures regression

    dependent_variable coef std err t P>|t| [0.025 0.975] R_SQ F_STAT N
    ratio.priceToOperatingCashFlowsRatio 10.7486 0.722 14.892 0 9.333 12.164 0.123 221.8 1589
    ratio.priceCashFlowRatio 10.7486 0.722 14.892 0 9.333 12.164 0.123 221.8 1589
    key_metrics.pocfratio 10.7517 0.722 14.892 0 9.336 12.168 0.123 221.8 1588
    ratio.dividendYield -0.0191 0.001 -12.774 0 -0.022 -0.016 0.108 163.2 1351
    key_metrics.dividendYield -0.0191 0.001 -12.766 0 -0.022 -0.016 0.108 163 1350
  • Plots for financial variables with the highes OLS R Squared values

correlation graphs