Master Data Management

    No Comments

    Introduction

    Gartner describe Master Data Management (MDM) as a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets. Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.

    It is a detailed and complex description but there are some key points to note here

    • Business & IT must work together, each with an understanding of the other’s requirements and limitations. The technology enables the enhanced management but the business must utilise this accurately, consistently and wholly
    • Stewardship / accountability – the key word in Master Data Management in Management. The data must do defined, monitored and controlled There must be a person or people in place, held accountable for the quality and maintenance of the data. It is here that MDM falls into the realm of Data Governance
    • Uniform identifiers of core entities – structuring of the way the data is to be held is a key requirement of the setting up of and MDM system. With data coming from potentially multiple sources (countries, companies, suppliers etc.) there must be a structure placed upon it in order to be able to manage it effectively. Defining what will be managed and how must be clearly defined.

    What is Master Data?

    Master data is common data about customers, suppliers, partners, products, materials, orders, accounts and other critical “entities,” that is commonly stored and replicated across IT systems. This information is highly valuable, core information that is used to support critical business processes across the enterprise. It is at the heart of every business transaction, application, report and decision. With the ability to source data from an expanding number of areas including the internet, customer direct input, machine monitoring and reporting plus a high level of meta data, companies are in a position where their “Big Data” they acquire can spiral out of control if not managed properly (or at all!). The combination of MDM and emerging big data technologies provides a 360-degree view of customers and products.

    Meta Data

    Simply put, metadata is data that describes other data. “Meta” is a prefix that means “an underlying definition or description”. Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier. For example, author, date created and date modified and file size are examples of very basic document metadata.  Having the ability to filter through that metadata makes it much easier for someone to locate a specific document. In addition to document files, metadata is used for images, videos, spreadsheets and web pages. Metadata can be created manually, or by automated information processing. Manual creation tends to be more accurate, allowing the user to input any information they feel is relevant or needed to help describe the file.

    Core Principles

    The overall goal of MDM is the same as any other Business Intelligence process or tool in its effort to help the user really understand the data. The first step is to get the data, in MDMs case from multiple sources and in multiple formats. It must then categorise and label it in order to be able to analyse it in a structured and accurate fashion. The data can then be turned into information, where facts can be drawn from the figures. Customer habits, manufacturing costs and competitor performance can be identified and quantified. This information can be turned into knowledge where it is analysed and applied to identify trends, upcoming opportunities or potential threats. The most important part of this process is the utilisation of the knowledge gained to make decisions that will benefit the business.

    What happens if not implemented

    In most organizations, operational information is duplicated and scattered across multiple systems and applications, which makes it difficult for decision-makers to achieve a unified view of operational intelligence. Disparate information also prevents customers from getting the accurate and timely information they need to make purchasing decisions. In fact, most transactional data is linked in some way to master data. So, missing data, low quality information and untrustworthy or inaccurate records have a big impact on revenue, productivity, costs, compliance, agility and decision-making. Therefore, managing this master-level information proactively as it flows through the organization is essential to improving business performance.

    What are the benefits?

    • All data stored in a centralised, single source of truth. This mean that data is accurate and trustworthy data that can used with confidence.
    • The quality of the information deemed from the data is improved as it is easier to analyse and provides a full view of all data at once
    • A single source of data can be cheaper to manage and access (once a certain level of data mass is reached e.g. little advantage to small company)
    • Improved business capabilities
    • Improved technical capabilities
    • Enhanced security is provided as data is stored in a single, controllable and secure location. Access can be controlled, managed and monitored to prevent unauthorised access. Even with a single source of data, access can be granted to parts of data on a principle of least privilege basis.

    Key factors in a successful MDM

    In order for a successful MDM system, a number of key factors need to be taken into consideration.

    • As stated earlier, and MDM is a business project not an IT one. The business owners must be involved throughout the processes, providing input and defining expected outputs. In a 2011 study carried out by PWC on MDM, they found that 70% of those they interviewed said that it was the revised governance and good management that provided the most benefit Vs only 27% considered the state of the art IT to be the key success factor.
    • For it to be effective, all departments, locations, business units etc. must be equally committed. Any siloed information removes the core benefits of and MDM as it prevents visibility of the whole picture. However, this should not stop the rollout of the MDM being done in phases. Easily accessed data or controlled departments should be focused on from the start to provide immediate affect and show the MDM to be beneficial to the potentially more difficult areas.
    • MDM must be considered as a cultural choice of a business. It is not a one-off project with one off results. It needs to be open to development and expansion as the business grows its product list, its locations or customers.
    • The rollout of an MDM needs to be driven by C-Level business owners. Sponsorship of a project is nice to have but for a successful MDM, it requires top management to “want” the information that it is capable of providing.

    Domino’s Pizza, which has 12,000 stores worldwide spread over 75 countries has always tried to stay at the forefront of the technology market. However, in 2014 CIO Kevin Visconi decided the time was right to implement a MDM to even better understand their customers. A tender resulted in Profisee’s Mastreo platform being implemented with steps taken to gather and cleanse 550 million unique customers. Analysing their buying habits allowed Domino’s to identify 100 million “golden” customers to carry out focused marketing on. With the MDM in place, Domino’s continues to expand its understanding of its customers habits and is looking at rolling out the MDM to manage its suppliers and products.

    Implementing MDM

    There are 5 core steps in the implementation of an MDM

    • Discovery – Documenting and modelling essential business data and processes for utilizing common data, identifying all data sources and defining metadata. In this step, start with the most important subject area and define it. Additionally, in this step an IT architect should design the MDM architecture based on the organization’s planned approach and goals for managing master data and in conjunction with the existing enterprise architecture.
    • Analysis – this involves identifying the main sources of the data, evaluating data flows and transformation rules, defining the meta data definitions and data quality requirements. In this step, it’s essential to have the participation of representatives from an established data governance program. This is the most challenging step, since it is iterative and requires participation from a variety of role
    • Construction of the MDM in line with the architecture you’ve identified
    • Implementation of the database, populating it with the master data, assigning administrative and access right, defining change management processes and assessing the quality of the data.
    • Sustainment – continuing to rollout the MDM across the business, managing and maintaining it on an ongoing basis and controlling change management.

    Conclusion

    In the growing world of Big Data, Master Data Management needs to be a core component in any business’s long term strategic plans (as it is definitely in their competitors). It can be a big undertaking but if the proper steps are followed it can provide the ability to obtain knowledge from data that the business already owns. However, in order for it to be truly successful, it needs to be supported and driven from the top down and become part of the company’s ethos. However, given the proper respect and consideration from both business owners and IT alike, it can provide a wealth of information that can transform how the company sees itself and the world around it.

    Categories: Uncategorized

    The 3 V’s The 4 V’s The 5 V’s of Big Data

    No Comments

    Definition & Description

    Big Data is a phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity. The term Big Data can be used to describe both the volume of data (usually in the petabytes 250 or exabytes 260 bytes) but equally the technology used to store, format and analyse the various sources that make up the data set.
    Historically business data was generated by workers by entering data to computer systems. Then with the evolution of the internet, users were then in a position to generate their own data through internet searches, Facebook and shopping on sites such as Amazon, which represented a massive scale upwards in the amount of data available for analysis. However, with further advances in technology, mobile devices, IOT and the fact that everything can be connected via access to the internet, we are in a position where machines are accumulating data. Buildings full of machines gathering data, temp, humidity, electricity usage. Google tracking services, over 1 billion Facebook users, satellite imagery, smart cars etc. are all generating almost 50,000 GB per second. This is an order of magnitude above previous data generation and the concept of Big Data was brought in to attempt to process this information.

    The V’s of Big Data

    Many of the characteristics, and quite often the trouble with big data, can be categorised within the headings of Volume (BIG!), Variety (Unstructured), Velocity (Constant) and Veracity (Accuracy). Understanding the part each play is vital in order to get the real Value from Big Data.

    Volume

    Organizations collect data from a variety of sources, including business transactions, sensors, social media and information from sensor or machine-to-machine data. Your mobile phone alone tracks your almost every movement including location, speed, direction. It knows if you are gone for a jog or stuck in traffic allowing it to feedback on your personal health or report to others on the traffic in the area. With such huge volumes of data being created every day, standard storage and processing techniques of relational databases were becoming incapable of storing and analysing it all. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
    In the old days we brought the data from the memory to the CPU to process it. Hadoop allows us to bring multiple processers to the data so each CPU processing small parts of the data (parallel processing). This means that processing power is now scalability larger and (almost) in line with the scalability of the data increase. Hadoop helps spread and analyse huge data sets over its own distributed file system (HDFS), making data available to multiple computing nodes. It then utilises a framework called Map Reduce which takes care of scheduling tasks, monitoring them and re-executing any failed tasks. The main objective of map reduce if to split up the data into separate blocks, each of which are processed completely parallel to one another. The output of each individual piece of analysis is then consolidated to produce the overall output.

    Variety

    Traditionally, data is stored in relational data bases or data warehouses. These structured tools allow large volumes of information to be stored, accessed, queried and reported on. However, their structure is what restricts them from being truly useful when dealing with big data.
    One of the underlining principles of Big Data is that if you can get data you should keep it as there may well be useful bits (that you just did not realise were useful yet). However, data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. Even what should be similar data such as browser data, smart cars and even financial transactions vary from source to source and tend not to be interoperable.
    A common issue of processing Big Data is taking unstructured data and making it usable for analysis by machine or human. With a relational database, when getting a user to enter an address, you can insist on and validate fields such as Country, County and Town. However, if someone free types “Dublin” into a search engine, Google needs to decide if they mean Dublin Ireland or one of the Dublin’s in the US, Canada or even Belarus.
    One of the key features of Hadoop, is that it looks to process large volumes of unstructured data, manipulate and reformat it in such a way that is can be successfully queried by traditional methods.

    Velocity

    The large volume of data being produced are not all from large files and images. The majority of data is made up by small numbers, words or sentences, steaming through the internet at immense speeds. Twitter are processing an average of 6,000 tweets per second, Facebook sees 4 million likes per minutes and Google complete 3.5 billion searches per day! Outside of the internet, manufacturing businesses are tracking all facets of production, transport companies follow the location of their vehicles and sometime individual packages while shops watch our every purchase.
    For businesses to truly utilise Big Data, they need to be able to analyse it as soon as possible. Querying last month’s sale’s data might show you that a particular product was especially popular and that you ran out of stock. Buying large volumes this month might lead to it sitting on the shelf, as its popularity was limited. Many stores recently invested in Pokémon Go merchandise only to realise its popularity had faded in a matter of weeks.

    Veracity

    This relates to the quality and accuracy of the data. With data coming in such high volumes, from multiple sources and such high speeds it can be difficult to trust that rapid analysis and turn around of the information is accurate. Traditional Data management techniques using structures RDS and data warehouses, provide a consistent and therefore usually accurate solution. As the data is rarely live, it can be cleaned, verified and formatted. Data sets must be considered as accurate and algorithms robust and intelligent. This is especially important where decisions made on the results are automated.
    In 2010, the US stock market Flash Crash, saw the Dow Jones drop by 10% in a matter of minutes. It was believed to have been caused by a number of algorithms which had the power to buy and sell stocks when certain criteria were reached. Their actions seemed to compound the criteria which in turn led them to continue their buying and selling. In 2013, Google closed down its Google Flu Trends project, which analysed peoples searches for flu like symptoms in an effort to identify potential flu outbreaks after miscalculating expected numbers in a flu season by 140%. Issues such as Google’s own release of enhanced health based add-ons causing more people to search for “flu” or similar keywords, threw off their own results.

    Conclusion

    When it comes to Big Data, the most important factor to consider is the Value. It is what organizations do with the data that matters. Big data can be analysed for insights that lead to better decisions and strategic business moves!
    American Express started looking for indicators that could predict loyalty and developed sophisticated predictive models to analyse historical transactions and 115 variables to forecast potential churn. The company believes it can now identify 24% of accounts that will close within the next four months.
    Uber is cutting the number of cars on the roads of London by a third through UberPool that cater to users who are interested in lowering their carbon footprint and fuel costs. Uber’s business is built on Big Data, with user data on both drivers and passengers fed into algorithms to find suitable and cost-effective matches, and set fare rates.

    Categories: Uncategorized

    Power BI

    No Comments

    Microsoft’s Power BI

    Power BI is a tool from Microsoft that allows users to source, visualize and model data and then publish and share those reports. It allows you to pull data from multiple sources, into a single report, from files such as excel / csv, from databases with direct links to SQL and even web sources including Facebook, Google Analytics, Zendesk and many more. The reports it generates includes graphs and charts, tables, maps and interactive KPI reports. It even allows for the use of R script to analyse and present data. Reports are presented in tiles and tiles can be resized and moved around easily to make attractive, user friendly dashboards. Power BI is a relatively new product and the exciting thing is the Microsoft are releasing monthly updates with additional features and functions (https://powerbi.microsoft.com/en-us/blog/power-bi-desktop-march-feature-summary/).

    In order to test its functionality, I popped over to the large data analysis resource that is Kaggle. Kaggle was recently purchased by Google so they must be good (or evil……). Based on this I can only assume that Google intend to purchase the game Boggle, a jungle and maybe the rights to jingle bells (and everything else with a name like theirs). I downloaded a data-set of 5,000 movies with some basic headings such as Name, Genre, Critic Ratings, Budget and Release year. Quickly throwing year of release and budget into a bar chart showed that the data-set was only 2007-2011 (which I hadn’t actually checked previously) and that beyond a small drop in 2009, more and more money is being spent on movies. The 2009 drop can be explained by the global economic crash of 2007/2008 which would have affected movies coming out in 2009.

    Next I created a donut which split out the genres of movies release in the data set. Without going into too much detail we can clearly see that the top genres were Comedy, Action and Drama. However, where things get interesting, is when you use multiple tools in tandem. From the dashboard I can easily resize the tools and align them across the page as required. Now, clicking on (and in turn selecting) the data on one of the tools, affects the other tools also. Selecting the release year of 2011 in table 1, highlights the data in table 2 that refers to that year. You can clearly see in the donut that Thrillers and to a higher degree, Romance movies made up a much higher percentage of overall movies released that year.

    .
    In turn, clicking on comedy on table 2 and looking at the effect on table 1 shows that the amount of money invested in comedies decreased steadily between 2008 and 2011.

    One last example (just because I like how it looked) are the tree maps. Taking in the genre in both tables then in table 1 the average budget and table 2 the average user review, a 5 second report showing big budget action and adventure movies get bad reviews and low budget drama’s and even horrors score on average higher.

    All of the analysis in Power BI is quick and responsive (but the program does take a while to start up). The graphics are clear and crisp and I can really see this being used as a quick and effective dashboard. It can be linked to live data sources so can be setup for senior management and accessed by them as required with minimal housekeeping. One of the most powerful tools of power BI is the ability to publish dashboards to the web with the click of a button for access by multiple users. My report can be accessed at https://app.powerbi.com/groups/me/reports/fee7d94e-6a99-4d32-8bd3-1b686f7d21e5/ReportSection . All of the tools are interactive in the manner mentioned above and users can view tools in full screen as well as export data (for the whole table or just the sections highlighted). This is a tool that I can definitely see myself introducing to my workplace.

    Categories: Uncategorized Tags: Tags:

    Titanic (I haven’t seen the movie yet, please don’t spoil the ending)

    No Comments

    Titanic: Machine Learning from Disaster

    Kaggle (https://www.kaggle.com/) is a great resource for viewing cool data analysis carried out by like-minded enthusiasts (as well posting some of your own) and challenging users to competitions to test they knowledge and hone their skills. The datasets can be provided by Kaggle or more often by the users themselves, scraping and sourcing data from various locations. When doing a course on data analytics, it is very easy to get caught up looking through the various data sets as the ones mentioned tend to err towards the more interesting and topically relevant (although there are plenty of sleep inducers!).

    Just a quick look at the datasets (found at https://www.kaggle.com/datasets) has lost me 20 minutes while writing this. Celebrity Deaths is currently trending big time after a truly horrific 2016. What should ring as more horrific is the Global Terrorism Database, especially when glancing at a global heat map showing Northern Ireland as one of the more densely affected areas based on number of events (based on 1970-2016). Video games sales with review scores and ERSB ratings has been bookmarked but I think it is 5000+ movie data set that is going to make up my next blog (or even just an evening of my time). Anyway…. Back on track….

    The competition page consists of several challenges with varying degrees of difficulty that can be tackled individually or as part of a group, using 1 or multiple analysis techniques or languages. Each has one or multiple tasks to complete, usually involving generating the best algorithm to provide an analytical solution not only to the dataset provided but potentially to other similar datasets. Challenges can even have prize money awarded for the best solution (usually where the best solution can and will be used for the setter of the competition to make or save money in the future). Some of the challenges are more for educational purposes and are kept open on the site for many years. The one I am looking to complete is “Titanic: Machine Learning from Disaster”, which has been up since 2012 and will remain there until at least 2020. This has allowed many users to post their solutions in detail, using various languages and various techniques. I am going to go through one posted by a site called DataCamp who offer several free tutorials and courses. The site allows entering R code direct into the browser, as well as providing hints and tips on what you are doing right and wrong. The course is located at https://www.datacamp.com/community/open-courses/kaggle-tutorial-on-machine-learing-the-sinking-of-the-titanic#gs.yReMDSs

    When loading up the tutorial, I got the option to do a beginner’s course in R but of course I’m far too advanced for that (eh, fingers crossed). The site gives you directions, pushes you in the right direct and awards XP for completing tasks accurately (and deducts XP for giving hints).

    First thing I learned was that 722 people actually survived which at almost 1/3 of the total has made me think that just maybe they’ve been exaggerating just how bad an outcome it was!! I thought only Rose survived (must admit to have not watched the film). Considering where it sank they were quite lucky to be rescued at all…. After some tough lessons, I realised that my spelling was worse than my R. Who knew “Survived” had an “I” in it or that R was case sensitive… I very quickly got into setting up new columns and setting their values based on data in other columns, something I’d do regularly in Excel and am realising is not as daunting in R as first (and second) attempted.

    Using train$Child[train$Age < 18] <- 1 & train$Child[train$Age >= 18] <- 0, I have defined all people on the boat under 18 as a child and identified that ~54% survived. However, back in 1912, the definition of a child was not quite 18. Setting the age in the code above to 14 and using table(train$Child) I identified that of the 113 under 18, only 71 were under 14 at which point survival rate increased to 59%. Below 9, this increase to 67%. Bearing in mind the survival rate of females for the same data set was >74% then it was more of a case of “women and young children first!”.

    The second chapter of the tutorial moved on to Decision Trees. This function runs an algorithm that scans all the variables in the data set and identifies the best ones to split by, one level at a time. With a seemingly simple piece of code, R had completed some data analysis on my behalf. Loading and utilising some libraries, had made the data look almost presentable. I must confess to having to take a while look back over the notes to fully understand what “I” had managed to do!

    Using the “predict” function, I was able to use the decision tree created above to actually make a go at the solution. In line with the format outlines on the Kaggle site, I was able to crate my first submission (using write.csv(my_solution, file = “my_solution.csv”, row.names = FALSE)), ready to upload! The final part of the tutorial involved looking at re-engineering the data in such a way that new columns were added to impact the tree above. Utilising data such as family size (under the assumption that larger families may take longer to get together) and Title such as Mr., Dr. Rev etc assuming certain ones would impact the overall survival I was able to enhance the quality of my submission to Kaggle!!

    Not going to complain for my first attempt (but I think there just might be room for improvement :))!

    Polls (good god), what are they good for, absolutely nothing!

    No Comments

    The Science of Polls

    In my first year in DBS during my “Maths & Statistics” Class, I learned that by polling a little over 1,000 people in a population of 100,000+ will give a result with a margin of error of only 3%. In my second year during “Information Systems”, I learned that information produced in polls could be used as part of BI (Business Information) and DSS (Decision Support Systems) in guiding decision-making processes. During my last year in college I learned that polls were not worth the millions spent on gathering the information, reporting it and the media that circled around the results after yet another failure by the polls in predicting a political outcome.

    The sample size of only 1,000 (especially when considering the US voting population of 230,000,000) seems extremely small. However, statistical analysis has proven using standard deviation calculations that above 1,000 the variance of a poll from the overall result is minor, accepting a 3% margin of error. As the error decrease the sample population needs to increase, but 3% has become the accepted norm as going above that become cost prohibitive very quickly. A. C. Nielsen Jr., president of the A.C. Nielsen global marketing research firm is quoted as saying “If you don’t believe in sampling, be sure—the next time you have a blood test…make them take it all!”

    <Insert Image of Blood Sucking Here>

    (I had planned an amusing image here of a vampire bleeding someone dry of their blood but after viewing >1,000 images from Twilight, Vampire Diaries and True Blood, I had to give up. Instead here’s the formula for calculating the margin or error, simples…..)

    Margin of Error

    Random Samples

    One of the core assumptions in carrying out a poll is that it is using a truly random sample of people. This means that for example the poll could not be taken solely within the hours of 9-5 as it would exclude the majority of the workforce. Equally, it could not be carried out is a small geographical location as there may be a high or low average wage in the area. Other things that may need to be taken into consideration (depending on the poll type) are sex, age, race, colour, religion amongst many others. However, even assuming that you got remotely close to a good balance of all of these factors, you are still relying on one thing, the honesty of people.

    When Polls Fail…..

    Over the last number of years, in a varying number of situations and countries, the polls have showed again and again that they cannot be truly trusted. Ireland this year (2016) had a general election with polls at one stage having a Fina Fail landslide. Later there was a socialist push and Sinn Fein were all but certainties to lead a coalition of AAA, PBP and the CBBs. In 2008 polls had Ireland voting yes to the Lisbon Treaty which eventually lost with many citing a lack of understanding as to its actual impact. One year later, once people were adequately educated they came out in force giving a clear no to the polls…. the treaty was passed in 2009. The 2015 UK election pollsters had a 0-3 point gap between Conservatives and Labour on 81 of the 92 “scientific poll” (http://www.bbc.com/news/uk-politics-32751993). Cameron and his conservative party romped home with almost 37% Vs Labour’s 30%. The most stand out poll failure (excluding Trump’s victory) in recent times is obviously Brexit. Polls were showing some tight margins giving a Remain the likely win. Voting on the day was coming to an end, and by 20:00 Farage and Johnson we’re going to bed with their tails between their legs. I awoke the following morning to interviews with comments such as “eh, yes, we may have said it but, eh, we didn’t actually think we’d eh…..”

    bus1

    Let the Public Have Their Say….

    A TV show I watch from the US called “Last Week Tonight with John Oliver” brought me to a news story on CNN broadcast after the September national debate between Trump and Clinton. It references multiple speeches from Trump stating he has “won” the debate and that this was clear from the polls. However, the when CNN reporter questioned the source of Trump’s data, the polls he cited were thrown back as being “un-scientific”. They came from sources such as Facebook, Time, CBS New York and the Washington Times PUBLIC VOTES. As they were not carried out within “scientific parameters” they were not considered legitimate. However, it could be assumed that when people vote on these polls from the privacy of their own homes, not answering to or looking in the eye an official poller, that they are potentially being more honest with themselves and more aligned with how their actions will be on the election day itself.

    Eh, maybe the US media needs to change its sources: http://www.dailymail.co.uk/news/article-3809204/Most-snap-polls-Trump-winning-debate-landslide.html

    While the sample pool of those accessing these polls cannot be guaranteed to be a fair distribution of the population, the increase in those with access to the internet, the ability to access through multiple devices and the large levels of subscriptions to services such as Facebook suggest that these online polls may be a more accurate indicator to the actual results than that of the “scientific” ones….

    (To avoid any confusion, I do not support Trump and have a genuine fear of what is to come….. However, I equally do not support the media and their control over people in their ability to report their version of the truth. Many have predicted the internet as the death of good journalism and as a whole I think it a valid point. However, I hope that the internet can be recognised as a source of the voice of the people and that this valuable resource can be channelled in new and creative ways!)

    References

    http://www.barrypopik.com/index.php/new_york_city/entry/if_you_dont_believe_in_random_sampling (checked 12/11/2016)

    Margin of Error Formula http://www.raosoft.com/samplesize.html (checked 29/11/2016)

    A Couple of worthwhile reads on the subject of polls.

    read http://www.pollingreport.com/ncpp.htm

    read http://www.redcresearch.ie/latest-polls/faq/

    2 R R not 2 R

    No Comments

    History of R

    Introduction to R

    The first step to the project was to complete an online R training module at the http://tryr.codeschool.com/ website. The module is broken into 8 sections that, with hands on use, quickly give a basic understanding of R and its core structure.

    1. Using R – Basic, mathematical, logical….. eh… what’s this section called files about?badge
    2. Vectors – Ok, they’re like arrays only with pirate dogs…
    3. Matrices – Ah, making some sense and producing some cool graphics
    4. Summary Statistics – averages & deviations (fun times)
    5. Factors – pch what?
    6. Data Frames – Now we’re getting somewhere
    7. Real-World Data – About Time!
    8. What’s Next – Check out my badge…

    Course Ratings….

    ca2_complex

    Here is a graph of my how interesting I found the tutorial, how much I understood what was going on and how complex I felt it to be. With Interest in orange and Understanding in Blue there is positive trend as the course went on. Complexity took a little dip but I’m sure a little practice will help there!!! As you can see there is a direct correlation (and it is safe to say causation) between interest and understanding and a direct inverse correlation to the complexity of the topics!

    Data Set

    In an effort to decide what to actually do the analysis on I ventured to the internet to see what types of datasets were readily available. It became very apparent very quickly that there were literally hundreds of suppositories of data, not only available to dig through but readily available and presented in clearly usable formats such as simple CSV or SQL databases. There are many large government sites, census date, cia.gov as well as international bodies such as UNICEF and WHO and private companies such as AWS, Google and Facebook. (Some hours later….) I took a step back and instead decided it better to choose a topic and search accordingly. As I type this Liverpool FC are all but top of the league barring goal difference and this has inspired some soccer related statistics. Again, there are plenty of sites, some more guarded with their data than others, but in the end, I was able to get what I wanted from a site called “football-data.co.uk”, which mainly deals with a little bit of gambling.

    Correlation

    Fair Treatment

    Sometimes you watch certain premier league teams who, over the years have gained themselves a bad reputation for surrounding the referee if they feel that life just isn’t going their way. Team likes this go out of their way to get opposition players booked. In an ideal world, the referee would not bow down to such pressure but often the impact of the players or the juxtaposition of a home game can be enough to sway many of a decision. In order to analyse this I looked at the number of cards awarded against teams in the 2015/2016 season and also the number of cards awarded to their opponents in each match. Note I did originally look at the red cards but as they only made up <5% of the overall cards awarded and the fact their distribution gave no real insight, I have removed them from the data-set. So, what did I find I hear you ask?

    Position Vs Cardsfinalposition2

    Firstly, there is a definitely a correlation between the number of bookings and final league position, with the lower placed team maybe playing a little bit dirtier (or the bigger teams getting away with more???). League champions Leicester ended the season with a very professional 48 yellow cards. The dainty legs of Arsenal only earned a league minimum of 39. Spurs were up there with 72 but do bear in mid they got 9 cards against Chelsea in Stanford Bridge alone. Beyond that there is a bit of a trend towards more bookings for the lower teams with bottom place Villa peaking the table!

    Ugly Dataca1_ugly

    Using R for more complex operations lead to a realisation that, as powerful a tool as it is, it cannot pull information from random data without at least a little bit of foresight from the user.

    These 2 tables (fondly renamed the ugly sisters after hours of attempting to make something beautiful from them) are proof that planning is required before trying to present data. There is a bit of random beauty in them if you keep staring for 60 minutes but the glass slipper just does not fit.

    Coming Togetherca1_data

    A little bit of research later and a little bit of planning and finally the numbers are starting to come together!! A beautifully deigned (notched) and coloured box plot shows us the average of 61 cards per season with most teams falling in the 54 to 63 upper and lower quartiles. A fine variance between 44 and 75 en-captures the rest of the teams, except Arsenal, who we identified earlier as a very polite team noted know as an outlier to the “normal” data with only 39 bookings.

    In looking at the number of teams grouped by the number of bookings received (in factors of 5 i.e. 40-45, 45-50, 50-55 etc.) we see a what is normally expected with a peak close to the middle, dipping either side. curiously there is a long build up with a good number of teams showing low number before we hit the general averages of 55-65. There is a big drop after that with no teams at the 65-70 level and only Aston Villa, West Brom and the team who lost their patience as they watched the league slip away from them in the last few games (Spurs!!) with over 70 cards.

    Results

    I did come away with some interesting nuggets of information from the analysis.

    1. Chelsea, who have a reputation for surrounding the referee, received themselves a total of 58 yellow cards (1.5 per match) over the season which ranks them pretty much average. The interesting thing is that, over their 38 matches, their opponents received a total of 108 yellow cards, an average of almost 3 per match and double that of those received. Considering their poor form that season it is hard to image that their silky skills were pulling vicious tackles and (with the aid of further analysis) one could almost assign causation between their treatment of referees and the referee’s treatment of their opponents!
    2. Bottom club at the end of the season, Aston Villa, ended the campaign with a league high 75 bookings showing their ability to kick players and not the ball, were not match winning tactics.
    3. Liverpool received a massive 15 more bookings in away games (38 in total) than in home games (only 23). This is the biggest difference in the league and would suggest that just maybe, the Anfield Roar is just enough to scare the referee from pulling out the card too many times!

    Please note that all of the above was used without the aid of the very powerful R package, ggplot2, as I thought is would be unfair to show off at this early stage. Wait until you see my next blog!!!!!

    Fusion Tables

    No Comments

    Task

    For the project, I was tasked to use Google Fusion Tables to create an intensity distribution of county colours based on population density (basically colour in a map). With the power of the internet at my disposal and the flexibility to analyse the data as I saw fit (within reason) I set off to look at the population across Southern Ireland (ROI) and North Ireland (NI).

    Fusion Tablesuk_excel_map

    Here is a map of the UK that was created in Excel and consists of a separate shape for each postcode, all independently movable. I spent quite a number of hours colouring in each individual shape in order to present zonal proposals to customers. Probably would have been quicker with a fusion table…..

    When I first saw the fusion tables I was quite excited to see what could be done with them and after a little use was amazed just how easily they could be customised. I was also more than a little annoyed as they provided tools that I could have really used in a role in my job that I only moved away from last year. A large part of my job was analysing customer data (volumes, locations, frequencies etc.) and presenting it in various formats to people throughout the management chain. The ability to create professional looking mapping information would have been highly beneficial in the role and the ability to do it so fast would have been highly beneficial in getting me home on time. However, I will take this as a positive learning experience and have already spoken to the guy who now fills that role and arranged some time to sit with him and hopefully impart some knowledge.

    Sourcing Data

    There are literally tons of places to go to get data on the web and it rarely takes much more that a search box and a little patience. The ROI Counties were retrieved from Googles own repository of mapping data https://research.google.com/tables. The KML I found actually covered Southern & Northern Ireland as well and England, Scotland and Wales. The NI data was unfortunately for townlands and not just counties so I had to source the country data elsewhere.

    Irelandun_Map_Uormatted

    My first Fusion Map!!!

    The NI Counties were sourced from https://www.opendatani.gov.uk/dataset/osni-open-data-largescale-boundaries-county-boundaries#. The www.opendatani.gov.uk site has a large number of data sets on various, government related subjects and provides the data in a number of formats (KML, CSV, JSON, ArcGIS etc.) so they are available to use in multiple applications.

    The Southern & Northern Ireland KML data had to be merged. The first thing I noticed about the data was that the Northern map was far more detailed with far more co-ordinates used to make up each county. In fact, the NI file with 6 counties was larger than the ROI file with 26! After much deliberation on how to combine the files, exporting to CSV, copying and pasting from one KML to another I had a brain wave to use the Fusion Table to Fuse them together. I might just be getting the hang of this….. No, didn’t work so went back to copying and pasting from KML.

    For the Population data, I headed over to the CSO website. We had been given a link with the relevant data but the format was a little messy. Digging around the site uncovered http://www.cso.ie/en/census/interactivetables/ where all manner of census data since the mid-1800s was easily (and neatly) available. The NI population data was found not on their census site (as they used townlands) but instead on a genealogy site.

    Using the Data

    intensity_distribution_irelandNow, with all the data in place and a respectable looking map it’s time to draw some insights. When using the feature map it is possible to change a number of the “feature styles” of the map such as the markers (if using points on a map), lines around the map, the side legmen and of course the polygon’s that are being used in my fusion map. Using the “Fill Colours” and “Buckets” option, it is possible to logically colour each of the counties in the map. All colours are available, down to 6 char hexadecimal designation, but for simplicity I stuck with varying shades of green (sure what else for Ireland?). Splitting the population evenly over 7 groups leads to the realisation that the majority of the country (22 ROI / 1 NI) have a volume of <200,000 people. This leads to a very bland colour scheme. Updating the buckets with range caps of 65, 85, 125, 157.5, 210k and 1 Million give a clearer picture of the spread. Turning on the legend also allowed people accessing the data to understand the choices made as , without this knowledge, the presenter of the data could easily skew it.

    This beautiful map can be viewed at……..

    https://www.google.com/fusiontables/DataSource?docid=1xcwze1kjl3yV2f-dm_pOiklfQL4NISTHmqWOFZ5u

    Looking at the data you can see a clear spread of volume to key areas North (NI), South (Cork), East (Dublin) and West (Galway). All populations are at coastal areas, strengthened by access to the sea and trade areas in the past. All coastal areas in general seem to have higher population with a band ranging north to south from Sligo to Laois showing the less populated midlands locations.

    Other interesting Data

    In going through the census data, I came across populations for the last 150 years. I took a look at the 1941 census to see what % of the population each of the counties made up of the total country (Southern Ireland only). Thinking there wouldn’t be much of a difference in the 2011 spread of population, I ran the numbers (in Excel) and was quite surprised at what I found. Since 1841, Ireland’s population has dropped almost 30% from 6.5M to almost 4.6M. However, the population of Dublin, in the same period, has increase from 372k to 1.27m. This represents a 342% increase in population while the country as a whole dropped by 30%! In 1841, Dublin made up a mere 5.7% of the national population and now makes up 27.7%. Counties bordering Dublin; Louth, Meath, Kildare and Wicklow have all seen increases.

    So, where is the exodus coming from? Well almost every other county in the country has seen their overall representation of the national volume drop between 1-2%. Surprisingly, Mayo are the county that have seen the biggest drop in representation. In 1841, Mayo made up 6% of the population (that’s more than Dublin) while now they have seen a drop of over 1/4 Million people, leaving their 130k residents making up only 2.8% of the country. The table below shows the areas where volumes have decreased in darkest blue and the areas of increase in reds.1841_2011_inc_graph

    Skip to toolbar