Datamilk

Welcome to Datamilk

About us

Datamilk is an independent Sydney based bespoke data mining consultancy. We specialise in all aspects of big data, data mining and predictive analytics.

Datamilk provides datamining services including programming, predictive modelling and business analytics. Our services span a range of areas including data cleaning, data management, regression models, logistic, ordinal and nominal models, time series and forecasting and more advanced techniques. We also run both public and onsite training courses in datamining techniques.

We specialise in solutions for small to medium sized businesses which typically don't have the in house resources to make the most of their data. We assist companies to identify the business issues they need to address, determine what data will be needed to solve the problem, select the correct datamining technique, translate the results into a meaningful business decision and incorporate datamining into the way they do business.

Who's doing the work?

About Ross Farrelly

Owner Ross Farrelly (M. App. Stats., BSc(Hons), M. App. Ethics., Grad. Dip. Ed.) has many years’ experience in successful consulting, training and programming for clients around Australia, New Zealand and South East Asia.

He is currently Chief Data Scientist at Teradata South Pacific.

He been part successfully designed and implemented big data projects, comprising both Hadoop and Teradata Aster technologies in large companies in Australia, Indonesia and Japan.

He is a certified advanced SAS programmer and also has extensive experience with R, SQL-MR, SQL and Java.

See here for Ross's Linkedin profile.

What is Big Data?

A Brief Introduction to Big Data

My preferred definition of the much hyped and much over used term “Big Data” is:

Big Data Analytics refers to analytics on data that is not able to be performed on a standard relational data warehouse in a timeframe and cost that is acceptable for its business purpose.*

There are generally speaking three aspects to big data: new data sources, new analyses and the challenges of scale.

Data Sources
Big Data generally means accessing new, previously unused data sources and then executing new analyses on these data. Data in such formats as XML, JSON, raw weblogs or free text are often not included in a traditional data warehouse implementation. The ingestion, parsing and analyses of these data sources often come under the umbrella of big data.

Analyses
Analyses which go beyond traditional SQL aggregation but which are done in-database on large data sets are generally encountered in a big data setting. Path analyses, graph analyses, predicative analytical analyses, clustering, regression, as well as entity extraction and parsing are some commonly used examples.

Scale
Many of the above analyses can be executed on a small scale when the dataset fits in memory. Tools such as R and SAS are typically used in this context. However, when the analyses need to be executed on a larger scale a new approach is needed. Currently one approach is to translate the required analyses into a framework such as Map Reduce (in the case of ordered data) and execute the analysis on a technology such as Hadoop or Teradata Aster. If the problem is one of graph analysis it may be expressed using the Pregel framework.

The confluence of these three factors: scalable advanced analyses on semi-structured data sources comprise the phenomenon of big data.

Datamilk has extensive hands on experience working on multiple big data projects in major companies in Australia, Japan and Indonesia. Contact us to discuss how big data can benefit your company.

* http://blogs.teradata.com/anz/how-to-start-a-successful-big-data-journey/

What is Data Science?

Data Science

All science is data science but not all use of data is scientific. Science without data is not science. However, the recently coined term “data science” refers to an emerging discipline which is an amalgam of various fields of endeavor including: statistics, data mining, data warehousing, parallel processing, computer programming and business consulting.

The basic idea of data science is to use large data sets and advanced analyses to solve significant business problems. A good data scientist has the skill to engage with the business users, consult with them to discover their needs, match those need to the appropriate data and analyses and them solve the problem using skills such as data manipulation, parallel process across a large cluster if needed, data mining and predictive analytics.

He or she can also take the vital final step of explaining the solution in plain English in terms the business user and understand and endorse.

Contact Datamilk to discuss how data science can improve your business.

What is Datamining?

A Brief Introduction to Data Mining

Data mining is the process of extracting useful information which may be hidden in the data owned by your business - information which can help you make better business decisions.

For example, a retailer may wish to know which of his customers are most likely to take their business elsewhere so he can target them with a intervention to try to retain them. There are data mining techniques (known as cluster analysis) to identify these customers and to test which interventions are most likely to retain customers. A diary farmer may which to know which cows to keep to improve his herd and which to sell. Using data mining techniques known as classification analysis, the calves can be classified as keep or sell using the historical milk production data.

A mining company may need to predict the gold yield they can expect from an ore body. Based on historical records, an equation (or model) can be developed to allow accurate predictions of this type. This is known as regression modelling.

A marketing department might want to measure the effect of an advertising campaign. The customers are randomly divided into two groups: a treatment group which receives a promotional email and a control group which does not. The email are sent out and the responses measures. By comparing the spend of the treatment and control groups, which due to the randomized way in which they were selected are as alike as possible in all respects, we can measure the effect of the campaign.

The data mining process runs as follows. First the problem which needs to be solved must be clarified and then quantified. The data available is then collected, cleaned, checked and assessed to see if it contained the necessary information to solve the problem. The data is then divided into two sets: a training set and a test set. The model is built on the training set and tested for performance on the test set.

For example, if your task is to the number of magazines sold by your retail outlets each week. This is important to know accurately because it cost you if you send too many (the unsold magazines are returned) or if you send too few (you miss out on sales). You collect the data for the previous year showing the demographics of the stores and the number of magazines sold. You build the model on two thirds of the data. You then use the model (or equation) to predict how many magazines will be sold at each store on each day in the remaining one third of your data (which the model has not seen.) Since you know how many magazines were actually sold, you can calculate how good your model is at making these predictions. If it is better than your current methods for deciding on how many magazines to send you could start using the model to make your business more profitable - and you can also calculate how much the model will save you. As conditions change the model will need to be maintained and improved to ensure that it remain as accurate as possible.

These are just a few examples of data mining - there are many more. But the thing they all have in common is the use of data to improve business decision and make your business more profitable.

Case Studies

Datamilk has many examples of consulting and model building including case studies such as:

  • predicting the number or offences likely to be committed by young drivers.
  • predicting the future value of an unpublished manuscript
  • predicting the probability of certain types of customers making an insurance claim
  • predicting the yield from a gold processing plant.

For more details on these and other case studies please contact us.

Sandbox

In addition to traditional data mining and advanced predictive analytics, I have expertise in a number of other technologies including:

  • Data visualisation using Google maps and Google visualisation api. See here for a beta example of this.
  • Text analytics and sentiment mining using python. See here for an example of a simple sentiment miner which searches a web pages for a search phrase and calculates the sentiment in each sentence.
  • R Programming. See here for an example of an R program I wrote to automatically create a visual pdf data dictionary of dataset.
  • Unsupervised Clustering See here for the R code implementing K means clustering and an animation demonstrating the results.
  • Animated Data Visulisation See here for an animated leaderboard developed for a Kaggle competition.

For more details on these projects and how you can use similar technologies to benefit your business, please contact us.

Contact

To explore the possibilities which can be realised by working with Datamilk, contact Ross Farrelly on:

0433 449 800

rfarrelly@datamilk.com

Linkedin profile

www.datamilk.com