Using Apache Spark for Machine Learning – Benefits of DataFrames vs. RDDs




Here at Bolt, we use Apache Spark and MLlib to power our machine learning pipelines. Spark has grown considerably in the market in recent years, and we have personally found it can dramatically accelerate the pace of development, replacing entire components that otherwise would have to be built from scratch.

In getting started with Spark, there are a variety of approaches to consider. The primary interfaces include Resilient Distributed Datasets (RDDs), and Spark SQL (DataFrames / Datasets). RDDs are the original API that shipped with Spark 1.0, where data is passed around as opaque objects. RDD operators are limited (map, reduceByKey, …), though likely more familiar to Hadoop users / functional programming developers. DataFrames by contrast offer a more ergonomic interface, where data is represented in a tabular format (rows / columns) that comes with an attached schema and rich SQL-like operators.


How To Automatically Segment Your Data With Clustering

One of the most common analyses we perform is to look for patterns in data. What market segments can we divide our customers into? How do we find clusters of individuals in a network of users?

It’s possible to answer these questions with Machine Learning. Even when you don’t know which specific segments to look for, or have unstructured data, you can use a variety of techniques to algorithmically find emergent patterns in your data and properly segment or classify outcomes.

In this post, we’ll walk through one such algorithm called K-Means Clustering, how to measure its efficacy, and how to choose the sets of segments you generate.  


4 Reasons Your Machine Learning Model is Wrong (and How to Fix It)

There are a number of machine learning models to choose from. We can use Linear Regression to predict a value, Logistic Regression to classify distinct outcomes, and Neural Networks to model non-linear behaviors.

When we build these models, we always use a set of historical data to help our machine learning algorithms learn what is the relationship between a set of input features to a predicted output. But even if this model can accurately predict a value from historical data, how do we know it will work as well on new data?

Or more plainly, how do we evaluate whether a machine learning model is actually “good”? 

In this post we’ll walk through some common scenarios where a seemingly good machine learning model may still be wrong. We’ll show how you can evaluate these issues by assessing metrics of bias vs. variance and precision vs. recall, and present some solutions that can help when you encounter such scenarios.


What is an Artificial Neural Network

These days we hear a lot about Artificial Neural Networks. Facebook uses them to classify different types of text in their posts. Zillow recently started using them to better predict house prices from images. Google even open sourced their technology to help any company build their own.  

But what are Neural Networks? And when should you use one?

Put simply, a Neural Network is another application in Machine Learning, though based on how the human brain processes and solves problems. As opposed to regression models that predict an outcome based on a linear relationship between a set of inputs, Neural Networks can algorithmically construct a model based on more complex non-linear relationships.  

In our continuing ML 101 Series, we’ll walk through when and how you can use Neural Networks to make predictions on your own data, and some examples of when they’re useful (and when they aren’t).


How to Predict Yes/No Outcomes Using Logistic Regression

Often we want to predict discrete outcomes in our data. Can an email be designated as spam or not spam? Was a transaction fraudulent or valid? 

Predicting such outcomes lends itself to a type of Supervised Machine Learning noted as Binary Classification, where you try to distinguish between two classes of outcomes.

One of the most common methods to solve for Binary Classification is called Logistic Regression. The goal of Logistic Regression is to evaluate the probability of a discrete outcome occurring, based on a set of past inputs and outcomes. As part of our continuing ML 101 series, we’ll review the basic steps of Logistic Regression, and show how you can use such an approach to predict the probability of any binary outcome.


How to Predict Any Value Using Linear Regression

One of the most common questions we have of our data is evaluating the value of something. How many items will we sell next month? How much does it cost to produce them? How much revenue will we make over the year?

You can often answer such questions with Machine Learning. As covered in our previous post on Supervised Machine Learning, if you have enough historical data on past outcomes, you can make such predictions on future outcomes.

One of the most common Supervised Learning approaches to predicting a value is Linear Regression. In Linear Regression, the goal is to evaluate a linear relationship between some set of inputs and the output value you are trying to predict. As part of our continuing ML 101 series, we’ll review the basic steps of Linear Regression, and show how you can use such an approach to predict any value in your own dataset.


The Two Types of Machine Learning


Machine Learning can seem like a daunting field. But the core concepts are, with a little help, quite accessible.

To better understand the field of Machine Learning, we wanted to provide some quick overviews of the fundamental concepts as part of our ML 101 series. In this first post, we review common applications of the field, and the differences between the two subtypes of Supervised vs. Unsupervised Machine Learning.


Why You Don’t Need a Data Scientist

Data Science is growing.

It’s been called the “sexiest job of the 21st century”, and is attracting a flood of new entrants.

Recent reports indicate that there are 11,400 data scientists who have held 60,200 data-related roles. And the overall count has grown 200% over the last 4 years, across Internet, Education, Financial Services, and Marketing industries.

And yet amidst a field growing so fast, you can observe a bit of confused exuberance. It’s not uncommon for a company to hire a data scientist just after product launch, or after Series A. To some, data science has become the magic bullet for achieving scale or their next inflection point.

But what does a data scientist do? And does your company actually need one?


7 Simple Rules to Ensure Data Quality in Your Data Warehouse

When importing data into your data warehouse, you will almost certainly encounter data quality errors at many steps of the ETL pipeline.

An extraction step may fail due to a connection error to your source system. The transform step can fail due to conflicting record types. And even if your ETL succeeds, anomalies can emerge in your extracted records – null values throwing off stored procedures, or unconverted currencies inflating revenue.

How do you catch these errors proactively, and ensure data quality in your data warehouse?

Here at Bolt we have experience automating processes to ensure that the data in your warehouse stays trustworthy and clean. In this post we outline 7 simple rules you can use to ensure data quality in your own data warehouse. We used rules like these at Optimizely with great results.


10 Free Resources for Customer Intelligence

Customer intelligence is essential for market research and sales prospecting.

Such analysis requires segmenting customers by their company’s properties. Common properties often include measures such as web traffic, app performance, technology adoption, ad spend, and company size.

But how do you identify which companies have these properties?

In this post, we review 10 (mostly) free resources you can use to identify companies by these properties, and help diligently scale your customer intelligence efforts.