HPCC Systems is the Open Source Data Lake technology developed by LexisNexis Risk Solutions to solve Big Data problems for the LexisNexis business. The first version of the technology, based on distributed parallel processing, was developed in the year 2000.
HPCC Systems is designed to host and process petabyte of data and integrate data from thousands of sources.
One important distinction between other similar tools and HPCC Systems is that the innovation and quality of HPCC Systems has a direct impact to the success of LexisNexis. Therefore, HPCC Systems is rigorously tested and supported by experts in data analysis.
HPCC Systems is used and certified in production environments within LexisNexis.
The following are a few facts that you need to keep in mind:
- HPCC Systems is licensed under the Apache 2.0 License
- HPCC Systems is free to use for all purposes
- HPCC Systems is an enterprise ready system
- Free online training and documentation
- AWS integration with Open Source cloud formation templates
- Free support for startups and special projects
The HPCC Systems introduction is divided into the following two parts:
The goal of the architecture discussion is to provide an introduction of HPCC Systems from the view of a solutions developer, and to answer questions such as - "why HPCC Systems uses a declarative approach to programming?" and "why HPCC Systems helps data scientists who understand the domain?"
The goal of the ECL programming discussion is to introduce the data scientist to the minimum tools needed to solve a real-world problem. You will be amazed how quickly you can get started and start solving problems.
Understand how to code a program using modular structure that enable you to create clean and reusable code.
Go to ECL Programming Structures
How can Spark programmers get started quickly without having to learn ECL from scratch?
You are introduced to an end to end coding example. Topics covered involve data importing, cleaning, enriching, analyzing, predicting using machine learning and exporting data.