New York State Powerball Data
The New York Powerball data comes from data.gov. The data has been accumulated since 2010. Three columns make up the data: Date, winning numbers and a multiplier.
One powerball number is made up by six part or six numbers. According to how the data appears in the file, the six numbers that make up one whole number can be made up of single or double digits. The data displays the number 6 as 6 and not as 06.
The following will be done to demonstrate the machine learning process:
Explore the data set
Analyze Frequencies
Create a regression machine learning model
Create a XGBoost machine learning model
For both models we will input high frequency and low frequency numbers to figure out the probability of the number selected being a winning number
For the XGBoost model we will have the model output the top 5 highest probability numbers for the New York State Powerball.
The intent is to play with the data set and explore what machine learning models can do. This is for fun and do not take the outputs seriously in terms of expecting to win anything in the New York State Lottery. If anyone decides to go out and try winning with any of the outcomes they see in this article, it is at your own financial risk and this article does not claim that you will win if you play any of the numbers.
library(dplyr)
library(lubridate)
library(tidyverse)
library(ggplot2)
library(transformr)
library(caTools)
library(progress)
library(caTools)
library(e1071)
library(xgboost)
library(caret)
Explore data
The original data came in three columns. It was downloaded as a CSV file from the data.gov website and imported into an excel file. In excel the multiplier column was dropped and the single winning column number expanded into six columns, one for each segment that makes up a whole lottery number. The file was saved as a csv and loaded into RStudio.
In RStudio the decision was made to drop the date column and to only work with the six columns that make up a whole winning number. The reason for dropping the date column was for simplicity and to build simple machine learning models.
The summary function demonstrates minimum, maximum, 1st and 3rd quarter, and the median and mean for each column. The numbers are arbitrary and randomly drawn. They are not intended to add up to anything or to average anything. The summary function does allow us to see the range of each column which gives us an idea of which numbers are selected more often.
For example, the first and sixth set have the smallest mean. The range is greater for the frist set since the min is 1 and the max is 52 wheareas the min for the sixth set is 1 and max is 39, but the mean is higher for the sixth set than the first set. It’s possible that the lower mean for the first set is do to smaller numbers being drawn more frequently than in the sixth set.
str(lottery)
summary(lottery)
WHAT IS THE FREQUENCY OF EACH NUMBER DRAWN IN EACH SET?
The table function gives the frequency of the numbers selected by New York Powerball.
The code provides the frequency for the first column. The range of the winning numbers in the first set or is from 1 to 52. Number 1 has been part of a winning number 110 times. Number 52 has been part of a winning number once since 2010.
The table function and the code will be duplicated for the other columns.
#extract frequency with table function
frequency1 <- table(lottery$first_set)
print(frequency1)
Frequency of 2nd set
Frequency of 3rd set
Frequency 4th set
Frequency 5th set
Frequency 6th Set
Examine Frequency Values
To get the range of the columns you can go through each column and extract the values using the range function and storing it in a variable.
To print out the variable we created you create a variable that stores a written string that will be part of the print message.
The last set of code concatenates the range and message variables to print out the range for each column in an easy to read format.
#calculate the range for each column
range_value_1 <- range(lottery$first_set)
range_value_2 <- range(lottery$second_set)
range_value_3 <- range(lottery$third_set)
range_value_4 <- range(lottery$fourth_set)
range_value_5 <- range(lottery$fifth_set)
range_value_6 <- range(lottery$sixth_set)
#define labele
label1 <- "Range of Column 1"
label2 <- "Range of Column 2"
label3 <- "Range of Column 3"
label4 <- "Range of Column 4"
label5 <- "Range of Column 5"
label6 <- "Range of Column 6"
#print label and range
cat(label1, ":", paste(range_value_1, collapse = " to "), "\n")
cat(label2, ":", paste(range_value_2, collapse = " to "), "\n")
cat(label3, ":", paste(range_value_3, collapse = " to "), "\n")
cat(label4, ":", paste(range_value_4, collapse = " to "), "\n")
cat(label5, ":", paste(range_value_5, collapse = " to "), "\n")
cat(label6, ":", paste(range_value_6, collapse = " to "), "\n")
Create the results above with a loop that indexes the columns
Another way to do what we did above is to use a loop that identifies each column through indexing and prints out the same message we did before.
In this example the loop will iterate through every column by indexing. RStudio identifies the column by name but also by a number. The first column is 1, the second column is 2, the third column is three etc… Indexing means that the loop will locate the column by the number of the column and apply the range function.
The advantages of a loop are many. Some of the advantages are efficiency and scalability, especially when you are dealing with large data.
# Create a loop to calculate and print the range for each column
for (i in 1:ncol(lottery)) {
range_values <- range(lottery[, i], na.rm = TRUE) #apply range function and ignore na's.
column_name <- colnames(lottery)[i] #index column names and assign to variable
label <- paste("Range of", column_name) #store print message for column name into variable
cat(label, ":", paste(range_values, collapse = " to "), "\n")#concatenate and print
}
Iterate the loop by indexing through non sequential columns
In this next example the loop filters through specific columns. There are times when you only want to do calculations specific columns.
The example identifies the indexes that we want and stores them in a variable. The for loop will then run through those specific columns and print out the range.
# Specify the column indices of interest
column_indices <- c(1, 2, 5, 6) #identify columns to apply loop to
# Create a loop to calculate and print the range for selected columns
for (i in column_indices) {
range_values <- range(lottery[, i], na.rm = TRUE)
column_name <- colnames(lottery)[i]
label <- paste("Range of", column_name)
cat(label, ":", paste(range_values, collapse = " to "), "\n")
}
Create a loop to get the range of columns by column name.
In the following example instead of indexing we use column names. A variable is created to store the column names that we want to run the loop through. The loop filters through the desired columns and uses the range function to do the calculation. A label is created for the print message and the variables created by the loop are concatenated and displayed.
# Get the column names of interest
column_names <- c("first_set", "second_set", "third_set", "fourth_set", "fifth_set", "sixth_set")
# Create a loop to calculate and print the range for each column
for (column_name in column_names) {
range_values <- range(lottery[[column_name]], na.rm = TRUE)
label <- paste("Range of", column_name)
cat(label, ":", paste(range_values, collapse = " to "), "\n")
}
The Value of Time
It took me more time to figure out how to write the loops than doing the first example where the range function is used on each of the six columns and the printing message is stored in six variables. The reason why it took me more time is because I struggle with writing loops.
I struggle to write loops because its an unused muscle. The data I deal with is small in size and my crutch is to use libraries and functions to do the work I need to do. I did this project in large part to that I force myself to figure out how to write loops.
In the next example we already used the table function to extract the frequency for each column. We will sort the variables we created to demonstrate the most and least frequent numbers for each set of numbers in order to display them in a more organized way.
Top 5 Most Frequent Numbers in First Set
#sort frequency variables to display top 5 most frequent
top5_frequencies1 <- head(sort(frequency1, decreasing = TRUE), 5)
top5_frequencies2 <- head(sort(frequency2, decreasing = TRUE), 5)
top5_frequencies3 <- head(sort(frequency3, decreasing = TRUE), 5)
top5_frequencies4 <- head(sort(frequency4, decreasing = TRUE), 5)
top5_frequencies5 <- head(sort(frequency5, decreasing = TRUE), 5)
top5_frequencies6 <- head(sort(frequency6, decreasing = TRUE), 5)
#store display labels into variables.
label1 <- "Most frequent numbers in first set"
label2 <- "Most frequent numbers in second set"
label3 <- "Most frequent numbers in third set"
label4 <- "Most frequent numbers in fourth set"
label5 <- "Most frequent numbers in fifth set"
label6 <- "Most frequent numbers in sixth set"
#concatenate the first set
cat(label1, "\n")
# Print the top 5 frequencies
print(top5_frequencies1)
1 has been the winning number 110 times since 2010.
2 has been the winning number 105 times since 2010.
3 has been the winning number 100 times since 2010.
5 has been the winning number 92 times since 2010
4 has been the winning number 85 times since 2010.
Top 5 Most Frequent Numbers in 2nd Set
#concatenate second set
cat(label2, "\n")
#print most frequent in second set.
print(top5_frequencies2)
12 has been the winning number 57 times since 2010.
28 has been the winning number 55 times since 2010.
21 has been the winning number 54 times since 2010.
15 has been the winning number 53 times since 2010.
20 has been the winning number 52 times since 2010.
For expediancy the subtitle and results will be posted for the remaining columns.
Top 5 Most Frequent Numbers in Third Set
39 has been the winning number 51 times since 2010.
37 has been the winning number 49 times since 2010.
36 has been the winning number 48 times since 2010.
35 has been the winning number 47 times since 2010.
32 has been the winning number 46 times since 2010.
Top 5 Most Winning Numbers in 4th Set
45 has been the winning number 56 times since 2010.
46 has been the winning number 53 times since 2010.
52 has been the winning number 52 times since 2010.
53 has been the winning number 51 times since 2010.
44 has been the winning number 48 times since 2010.
Top 5 Most Frequent numbers in 5th set
59 has been the winning number 85 times since 2010.
58 has been the winning number 80 times since 2010.
69 has been the winning number 79 times since 2010.
57 has been the winning number 66 times since 2010.
56 has been the winning number 64 times since 2010.
Top 5 Most Frequent Numbers in 6th set
24 has been the winning number 67 times since 2010.
18 has been the winning number 18 times since 2010.
4 has been the winning number 58 times since 2010.
11 has been the winning number 56 times since 2010.
14 has been the winning number 55 times since 2010.
When we compare the range and the frequency for the first and sixth set we can see that lower value numbers are more frequent in the first set than the sixth set which brings down the mean and median for the first column despite having a bigger range.
Top 5 Least Frequent Winning Numbers in Set 1
#sort by lowest frequency and store in a variable
top5_low_frequencies1 <- head(sort(frequency1, increasing = TRUE), 5)
top5_low_frequencies2 <- head(sort(frequency2, increasing = TRUE), 5)
top5_low_frequencies3 <- head(sort(frequency3, increasing = TRUE), 5)
top5_low_frequencies4 <- head(sort(frequency4, increasing = TRUE), 5)
top5_low_frequencies5 <- head(sort(frequency5, increasing = TRUE), 5)
top5_low_frequencies6 <- head(sort(frequency6, increasing = TRUE), 5)
#create string to print final output and store in variable
label1 <- "Least frequent numbers in first set"
label2 <- "Least frequent numbers in second set"
label3 <- "Least frequent numbers in third set"
label4 <- "Least frequent numbers in fourth set"
label5 <- "Least frequent numbers in fifth set"
label6 <- "Least frequent numbers in sixth set"
#concatenate for the first set
cat(label1, "\n")
# Print the top 5 least frequent numbers
print(top5_low_frequencies1)
Numbers 43, 48, 49, 50 and 51 have been part of the winning number in set 1 only once since 2010.
Top 5 Least Frequent Winning Numbers in Set 2
Numbers 50, 55, 48, 56 and 58 have been part of the winning number in Set 2 once since 2010.
Top 5 Least Frequent Winning Numbers in Set 3
3 has been the winning number in set 3 once.
60 has been the winning number in set 3 twice.
6, 8 and 62 have been the winning numbers in set 3 three times.
Top 5 Least Frequent Winning Numbers in Set 4
7, 9 and 13 have been the winning number in the fourth set once.
18 has been the winning number in the fourth set twice.
12 has been the winning number in the fourth set three times.
Top 5 Least Frequent Winning Numbers in Set 5
19, 22 and 25 have been the winning number in the fifth set once.
24 has been the winning number in the fifth set twice.
20 has been the winning nuber in the fifth set three times.
Top 5 Least Frequent Winning Numbers in Set 6
37 has been the winning number in the sixth set three times.
38 has been the winning number in the sixth set five times.
39 has been the winning number in the sixth set six times.
36 has been the winning number in the sixth set seven times.
31 has been the winning number in the sixt set eleven times.
Most Frequent Winning Numbers in Each Set
1 12 39 45 59 24
Least Frquent Winning Numbers in Each Set
43 50 3 7 19 37
There are multiple numbers that have the same frequency, we selected the first numbers on the list.
Build Machine Learning Model to Predict Lottery
Two models will be built to see the probability of winning if we were to plug in the numbers with most and least frequency.
This is for fun and by no mean a guarantee that anyone will win anything if they use the numbers and the outcomes from the machine learning models.
The two models that will be used are a regression model and an XGB Boost model.
Add Sum and Averages Columns Lottery Data Frame
#Calculate the sums and averages for the data set and append to the data frame.
lottery$Sum <- rowSums(lottery[, c('first_set', 'second_set', 'third_set', 'fourth_set', 'fifth_set', 'sixth_set')])
lottery$Average <- rowMeans(lottery[, c('first_set', 'second_set', 'third_set', 'fourth_set', 'fifth_set', 'sixth_set')])
Split the data
set.seed(123) # Set a seed for reproducibility
split <- sample.split(lottery$first_set, SplitRatio = 0.7)#split train and test data into 70/30 split
train_data <- subset(lottery, split == TRUE)
test_data <- subset(lottery, split == FALSE)
Create Model
model <- lm(first_set ~ Sum + Average, data = train_data)
print(model)
When we look at the coefficients, Avarage returns NA which means that the Average is not a meaningful predictor. Average not being a meaningful predictor makes sense since the data is lottery numbers and the numbers are randomly selected.
In the sum variable the increase is 0.1414 which means that for each unit increase in the sum variable the estimated change in the dependent variable is 0.1413. The increase is positive. We know that that lower value numbers are selected in higher frequency in the first column than in other columns but what is the strength of that relationship is difficult to say because the numbers are randomly drawn.
If the sum relationship with lottery ticket was ticket prices than the increase of 0.1414 would mean that the ticket sale would increase by each unit sale. In this instance the numbers increase per each selection but the change is more noticeable from the first set to the rest. It is a bit more murky in the middle sets.
#Evaluate model with summary function
summary(model)
The summary indicates that P-value is significant.
Residual standard error is 6.978 which means that the observed value differ from the predicted values by 6.978 units.
Multiple R-Squared is evaluated by zero in which the predictors explain none of the variance and 1 in which the predictors explain all of the variance. Our Multiple R-squares is 0.445 which means that the predictors can explain %44.05 of the variance in the model.
Adjusted R-Squared adjusts the Multiple R-Squared model based on the complexity of the model. Adjusted R-Squared is supposed to provide a more reliable measure of model fit. Adjusted R-Squared is usually slightly less that Multiple R-Squared. Our model has an adjusted r-squared of 0.44 which means that it explains 44% of the variance.
F-Statistics tests the overall significance of the model by comparing the variance explained by the model to the variance not explained. A large f-statistic and a small associated p-value means that the model has a significant impact on predicting the outcome. Our F-Statistic is 822.8 and our p-value is 2.2e-16 so the model is statistically significant which means that there is evidence of an effect, relationship, or association between variables. What we do not know is the significance of the effect, relationship or association.
What we know about the data is that these are the numbers that were selected by the state of New York as the winning number. What that means is that the numbers were randomly selected and that people in the state of New York either had the entire number, partial parts of the number or none and that determined if they won, lost etc…
We also know that some of the numbers in each location have a higher frequency than other numbers. We know that number 1 has the highest frequency of being selected as part of the winning number when it is in the first part of the entire lottery number than 2, 3, 4 etc…
We do not know why some numbers have a higher frequency than others. The numbers are supposed to be randomly selected but for whatever reason 1 has been selected the most since 2010.
If numbers are being selected and others are not then there is a relationship. Why are some selected and why are others not? That’s what we can’t answer.
In term of is this a good model, we would like the variance to be smaller and we would like to know the relationships and the reasons why numbers are selected in the frequency that they are. The model can be improved and more data into the selection process would improve the model. For now, it will do given that this is for fun.
Predictions and Evaluation of Predictions
#Use test data to make predictions
predictions <- predict(model, newdata = test_data)
mse <- mean((predictions - test_data$first_set)^2) # Mean Squared Error
mae <- mean(abs(predictions - test_data$first_set)) # Mean Absolute Error
rsquared <- 1 - (sum((test_data$first_set - predictions)^2) / sum((test_data$first_set - mean(test_data$first_set))^2)) # R-squared
print(mse)
print(mae)
print(rsquared)
Mean Squared Error is 46.15441.
Mean Absolute Error is 5.43097.
R-Squared is 0.3954749.
MSE and MAE provide insights into the magnitude of prediction errors. The lower the value the better the predictor. The higher the value the higher the prediction error. We would like a lower Mean Squared Error to decrease our prediction error.
R-squared indicates how well the predictors explain the variability in the model.
Predict Chances of Winning With Most Frequent Numbers In Each Set
Let’s use the numbers that have been selected the most and least frequent by the state of New York and see what the model says.
#try most common numbers in each set 1 12 39 45 59 24
new_numbers <- data.frame(Number1 = 1, Number2 = 12, Number3 = 39, Number4 = 45, Number5 = 59, Number6 = 24)
new_numbers$Sum <- rowSums(new_numbers[, c('Number1', 'Number2', 'Number3', 'Number4', 'Number5', 'Number6')])
new_numbers$Average <- rowMeans(new_numbers[, c('Number1', 'Number2', 'Number3', 'Number4', 'Number5', 'Number6')])
predicted_probabilities <- predict(model, newdata = new_numbers)
Probability of Winning With the Numbers
The model predicts that the probability of winning by using the high frequency numbers is 11%. That is not very good but if the model was trustworthy the probability would be better than what the probability winning without know what the high frequency numbers are.
Predict Chances of Winning with The Least Winning Numbers in Each Set
#data frame with least frequent numbers with each set
new_numbers <- data.frame(Number1 = 43, Number2 = 50, Number3 = 3, Number4 = 7, Number5 = 19, Number6 = 37)
#calculate column sums
new_numbers$Sum <- rowSums(new_numbers[, c('Number1', 'Number2', 'Number3', 'Number4', 'Number5', 'Number6')])
# calculate column averages
new_numbers$Average <- rowMeans(new_numbers[, c('Number1', 'Number2', 'Number3', 'Number4', 'Number5', 'Number6')])
# make prediction
predicted_probabilities <- predict(model, newdata = new_numbers)
Probability of Winning With the Least Frequent Winning Numbers in Each Set
What does it mean?
With the most frequent numbers we have a predicted probability of 11%. With the least frequent numbers we have a predicted probability of 8%. Does that mean that if we use those numbers we actually have an 11 or 8% chance of winning the lottery?
Take it all with a grain a salt. There are a lot of variables that are not factored into the regression model.
XGBOOST MODEL
A regression model is a singular model but there are algorithms that use multiple models together.
An XGBoost model is an ensemble model. It’s an ensemble model because it typically uses weak models such as decision trees and combines them to create a stronger model. The model learns from it’s weakness. Each decision tree learns from the previous weak decision tree. It’s a domino effect in which each tree is less weak because of the previous tree.
Split Data
#drop the Sum and Average columns from the lottery data.
lottery <- subset(lottery, select = -c(Sum, Average))
set.seed(123)
split3 <- sample.split(lottery$first_set, SplitRatio = 0.7) #split data into 70/30 split.
#Make the splits.
train_data2 <- subset(lottery, split = TRUE)
test_data2 <- subset(lottery, split = FALSE)
Build Model
#create model and place data into matrix
train_matrix <- xgb.DMatrix(data = as.matrix(train_data2[, c("first_set", "second_set", "third_set", "fourth_set", "fifth_set", "sixth_set")]),
label = train_data2$first_set)
model2 <- xgboost(data = train_matrix,
nrounds = 100,
verbose = 0)
Make Prediction
#transfrom test data into matrix
test_matrix <- xgb.DMatrix(data = as.matrix(test_data2[, c("first_set", "second_set", "third_set", "fourth_set", "fifth_set", "sixth_set")]))
# make prediction with test matrix
predictions2 <- predict(model2, test_matrix)
Evaluate Model
# create outcome column in test data and place predictions in column
test_data2$outcome <- predictions2
# create rmse, mae and r-squared variables and print outcomes
rmse <- RMSE(predictions2, test_data2$first_set)
print(paste("RMSE", rmse))
mae <- MAE(predictions2, test_data2$first_set)
print(paste("MAE", mae))
r_squared <- R2(predictions2, test_data2$first_set)
print(paste("R-squared", r_squared))
We can compare RMSE, MAE and R-squared between the models. RMSE and MAE have really low values. The lower the values the more accurate the model is.
R-Squared is evaluated from a range of 0 to 1. R-squared of 0.99999 explains all the variability in the target label. The model fit is excellent.
The XGBOOST model is a better model than the regression model.
Input High Frequency Numbers
# Place number in new_numbers variable
new_numbers <- data.frame(first_set = 1, second_set = 12, third_set = 39, fourth_set = 45, fifth_set = 59, sixth_set = 24)
#Create model
new_matrix <- xgb.DMatrix(data = data.matrix(new_numbers))
#Prediction
predicted_probabilities2 <- predict(model2, new_matrix)
#Print probability
print(predicted_probabilities2)
Use Winning Threshold to See if Number Will Win
# store 0.5 threshold in variable
winning_threshold <- 0.5
# Iterate result
if (predicted_probabilities2 > winning_threshold) {
print("The given set of numbers is predicted to be a winning number.")
} else {
print("The given set of numbers is not predicted to be a winning number.")
}
Try With Low Frequency Numbers
# store number in variable
new_numbers2 <- data.frame(first_set = 43, second_set = 50, third_set = 3, fourth_set = 47, fifth_set = 19, sixth_set = 37)
#input numbers and make prediction
new_matrix2 <- xgb.DMatrix(data = data.matrix(new_numbers2))
predicted_probabilities3 <- predict(model2, new_matrix2)
#print probability
print(predicted_probabilities3)
Can We Have The Model Spit Out Winning Numbers?
In the previous examples we replicated what we did in the regression number by selecting the high and low frequency numbers in each set to make up a lottery number and used the model to predict the probability of that whole number.
What we want to do now is see how a machine learning model could give us the top 5 most likely numbers to win so that we can go to the store and play the lottery.
# Convert the 'lottery' data frame to a matrix
lottery_matrix <- as.matrix(lottery)
# Predict probabilities for the entire 'lottery' dataset
predictions <- predict(model2, newdata = xgb.DMatrix(data = lottery_matrix))
# Create a copy of the 'lottery' dataset with predicted probabilities
lottery_with_predictions <- lottery
lottery_with_predictions$Predicted_Probability <- predictions
# Sort the dataset based on predicted probabilities in descending order
lottery_with_predictions <- lottery_with_predictions[order(-predictions), ]
# Select the top numbers with highest probabilities
top_numbers <- lottery_with_predictions[, c("first_set", "second_set", "third_set", "fourth_set", "fifth_set", "sixth_set")][1:5, ]
# Get the predicted probabilities for the top numbers
top_probabilities <- lottery_with_predictions$Predicted_Probability[1:5]
# Combine the top numbers and probabilities into a data frame
top_numbers_with_probabilities <- cbind(top_numbers, Probability = top_probabilities)
# Print the top numbers with probabilities
print(top_numbers_with_probabilities)
Here You Go New York
52 58 59 64 66 9. Probability = 52%
51 54 57 60 69 11. Probability = 51%
50 51 59 61 63 4. Probability = 50%
49 53 57 59 62 26. Probability = 49%
48 49 57 62 69 19. Probability = 48%
Conclusion
Comparing the regression and the XGBOOST model and the probability that the models generated were 8 and 11% for the regression model. When we have the XGBOOST spit out the lottery numbers more likely to win the highest probability is 52%.
More data could improve the models depending on the data. The challenge in using a machine learning algorithm to predict winning lottery numbers is that lottery numbers are selected at random. Other problems is the lack of data, the rules and changing process to select numbers.
We also did not talk about over fitting. Over fitting is when a machine learning model becomes specialized with the training data but struggles with the testing data and with new data. It learn a lot from the training data but when new data is presented, the model is not capable of being as accurate with the new data.
In this example we trained and tested the models and then used the entire data to spit out possible winning numbers. We are asking the model to predict the winning numbers from a set of winning numbers. New winning numbers can be selected that previously had not won and because that is a possibility, there is missing data that is not being fed to the machine learning model.
0 Comments