Material Big Data

Lanzados ppts informativos de tecnologías BigData: Hadoop, Hbase, Hive, Zookeeper...

Apuntate al Programa de Especialización en Big Data y Machine Learning en Perú!!

Diseño multidimensional, Big Data, ETL, visualización, open source, Machine Learning

Pentaho Analytics. Un gran salto

Ya se ha lanzado Pentaho 8 y con grandes sorpresas. Descubre con nosotros las mejoras de la mejor suite Open BI

La mejor oferta de Cusos Open Source

Después de la gran acogida de nuestros Cursos Open Source, eminentemente prácticos, lanzamos las convocatorias de 2018

18 abr. 2018

STDashboard, a free license way to create Dashboards



The improvements in this version of STDashboard are focused on user interface for panel and dashboard and also some enhancement in performance and close some old bugs. It works with Pentaho and embeded in web applications

You can see it in action in this Pentaho Demo Online and as a part of LinceBI suite

STDashboard doesn´t requiere anual license, you can manage unlimited users and it´s open source based. 

STDashboard includes professional services (training, support and maintenance, docs and bug resolution - so, you have high enterprise level guaranteed -)

Interested? contact Stratebi or LinceBI




See a Video Demo:


About UI improvements:

 - New set of predefined dashboard templates. We have designed a new way to manage dashboard panels that allow you to shape the dashboard in almost any combination of size, proportion and amount of panel you want to have. For this reason we have created a set of different layouts for most common cases.




- Embed in any web application. This sample shows STDashboard in LinceBI




 - Self managed panel. Add and remove panels, now in stdashboard you can add or remove panels easily using the button inside each panel header.



 - New layout management. Now an stashboard layout is composed of a list panel container, the containers in this list are stacked vertically in the page. There are two types of such containers; horizontal and vertical, each one stores a list of real panels (the ones where the graph are drawn) in an horizontal or vertical flow, in this ways you can combine those panels to achieve almost any layout you can imagine.





 - Resizable panels. We have included the possibility of resize the panel horizontally or vertically, keeping the proportion of graph inside it in correspondence with horizontal adjacent panels without making an horizontal scroll in the page, that means if you shrink a panel horizontally and there is another panel in the same row, the other panels also shrink an a proportional way to allow all panels in a row fit the horizontal size of the window. 

Is interesting to note here that we have implemented this functionality using pure GWT API, to avoid external dependencies and ensure portability between browsers.

 - Draggable panels. Each panel in the entire dashboard can be dragged to any parent container. In the header of each single panel the is a handle that allow dragging the panels to any panel container in the dashboard.



 - Responsive Dashboard. The ability to resize dynamically the panels and graph when the window's dimensions change, or when a user make zoom in the page is now implemented, also in most phones the dashboard can be seen proportionally and keeping the original layout.

 - Persistent state of the layout. When you save a dashboard to a file, we are saving the visual state of it and store it in the file. Then, when you open the dashboard, all the details of visual interface are hold and you can see the dashboard exactly the same previous to saved, that means panels size, locations are restored effectively.



About performance:

 - In some points of the application an specific query was causing performance problem. To know if a member has child or not in a multilevel hierarchy, the previous code issued a query to list all the sons of that member and check if the size is greater than 0, our solutions in this case for this type of query was simply check the level of the current member and in this way answer that boolean query.

 - Connection to cubes using the new MondrianOlap4jDriver java class. This improve the connection performance and stability because is designed for mondrian connections, the previous code was using an standard JDBC connection.


About new enhacements:

- Date configuration for filters. Date dimension are special dimensions, because almost any cube has at least one defined and are very used for make range query over fact table, to allow dynamic filter in panels, we had to enable a .property file that allow the user to define their date dimension and configure the way they want to use it in queries.



Added the Pentaho File Explorer to allows the users navigation through the files stored in pentaho, like reports, documents, etc and embeed it inside a panel in the dashboard



14 abr. 2018

Libro Gratuito: Front-End Developer Handbook 2018


Que todavía no lo habéis descargado? Un libro imprescindible!! Front-End Developer Handbook 2018

Descargar pdf

Contenido:

Introduction
What Is a Front-End Developer?
Recap of Front-end Dev in 2017
In 2018 expect...

Part I: The Front-End Practice
Front-End Jobs Titles
Common Web Tech Employed
Front-End Dev Skills
Front-End Devs Develop For...
Front-End on a Team
Generalist/Full-Stack Myth
Front-End interview questions
Front-End Job Boards
Front-End Salaries
How FDs Are Made

Part II: Learning Front-End Dev
Self Directed Learning
Learn Internet/Web
Learn Web Browsers
Learn DNS
Learn HTTP/Networks
Learn Web Hosting
Learn General Front-End Dev
Learn UI/Interaction Design
Learn HTML & CSS
Learn SEO
Learn JavaScript
Learn Web Animation
Learn DOM, BOM & jQuery
Learn Web Fonts, Icons, & Images
Learn Accessibility
Learn Web/Browser APIs
Learn JSON
Learn JS Templates
Learn Static Site Generators
Learn Computer Science via JS
Learn Front-End App Architecture
Learn Data API (i.e. JSON/REST) Design
Learn React
Learn State Management
Learn Progressive Web App
Learn JS API Design
Learn Web Dev Tools
Learn Command Line
Learn Node.js
Learn JS Modules
Learn JS Module loaders/bundlers
Learn Package Managers
Learn Version Control
Learn Build & Task Automation
Learn Site Performance Optimization
Learn Testing
Learn Headless Browsers
Learn Offline Dev
Learn Web/Browser/App Security
Learn Multi-Device Dev (e.g., RWD)
Directed Learning
Front-End Schools, Courses, & Bootcamps
Front-End Devs to Learn From
Newsletters, News, & Podcasts

Part III: Front-End Dev Tools
Doc/API Browsing Tools
SEO Tools
Prototyping & Wireframing Tools
Diagramming Tools
HTTP/Network Tools
Code Editing Tools
Browser Tools
HTML Tools
CSS Tools
DOM Tools
JavaScript Tools
Static Site Generators Tools
Accessibility Dev Tools
App Frameworks (Desktop, Mobile etc.) Tools
State Management Tools
Progressive Web App Tools
GUI Development/Build Tools
Templating/Data Binding Tools
UI Widget & Component Toolkits
Data Visualization (e.g., Charts) Tools
Graphics (e.g., SVG, canvas, webgl) Tools
Animation Tools
JSON Tools
Placeholder Images/Text Tools
Testing Tools
Front-end Data Storage Tools
Module/Package Loading Tools
Module/Package Repo. Tools
Hosting Tools
Project Management & Code Hosting
Collaboration & Communication Tools
CMS Hosted/API Tools
BAAS (for Front-End Devs) Tools
Offline Tools
Security Tools
Tasking (aka Build) Tools
Deployment Tools
Site/App Monitoring Tools
JS Error Monitoring Tools
Performance Tools
Tools for Finding Tools

13 abr. 2018

From Big Data to Fast Data



Muy buen articulo de Raul Estrada. Principales puntos:

1. Data acquisition: pipeline for performance

In this step, data enters the system from diverse sources. The key focus of this stage is performance, as this step impacts of how much data the whole system can receive at any given point in time.


  • Technologies
    For this stage you should consider streaming APIs and messaging solutions like:
    • Apache Kafka - open-source stream processing platform
    • Akka Streams - open-source stream processing based on Akka
    • Amazon Kinesis - Amazon data stream processing solution
    • ActiveMQ - open-source message broker with a JMS client in Java
    • RabbitMQ - open-source message broker with a JMS client in Erlang
    • JBoss AMQ - lightweight MOM developed by JBoss
    • Oracle Tuxedo - middleware message platform by Oracle
    • Sonic MQ - messaging system platform by Sonic
For handling many of these key principles of data acquisition, the winner is Apache Kafka because it’s open source, focused on high-throughput, low-latency, and handles real-time data feeds.


2. Data storage: flexible experimentation leads to solutions

There are a lot of points of view for designing this layer, but all should consider two perspectives: logical (i.e. the model) and physical data storage. The key focus for this stage is "experimentation” and flexibility.


  • Technologies
    For this stage consider distributed database storage solutions like:
    • Apache Cassandra - distributed NoSQL DBMS
    • Couchbase - NoSQL document-oriented database
    • Amazon DynamoDB - fully managed proprietary NoSQL database
    • Apache Hive - data warehouse built on Apache Hadoop
    • Redis - distributed in-memory key-value store
    • Riak - distributed NoSQL key-value data store
    • Neo4J - graph database management system
    • MariaDB - with Galera form a replication cluster based on MySQL
    • MongoDB - cross-platform document-oriented database
    • MemSQL - distributed in-memory SQL RDBMS
For handling many of key principles of data storage just explained, the most balanced option is Apache Cassandra. It is open source, distributed, NoSQL, and designed to handle large data across many commodity servers with no single point of failure.


3. Data processing: combining tools and approaches

Years ago, there was discussion about whether big data systems should be (modern) stream processing or (traditional) batch processing. Today we know the correct answer for fast data is that most systems must be hybrid — both batch and stream at the same time. The type of processing is now defined by the process itself, not by the tool. The key focus of this stage is "combination."


  • Technologies
    For this stage, you should consider data processing solutions like:
    • Apache Spark - engine for large-scale data processing
    • Apache Flink - open-source stream processing framework
    • Apache Storm - open-source distributed realtime computation system
    • Apache Beam - open-source, unified model for batch and streaming data
    • Tensorflow - open-source library for machine intelligence
For managing many of the key principles of data storage just explained, the winner is a tie between Spark (micro batching) and Flink (streaming).


4. Data visualization

Visualization communicates data or information by encoding it as visual objects in graphs, to clearly and efficiently get information to users. This stage is not easy; it’s both an art and a science.

Technologies

  • For this layer you should consider visualization solutions in these three categories:
  • Comparacion de sistemas Open Source OLAP para Big Data

    Ya os hemos hablado en este blog mucho de nuestra solucion Open Source OLAP para Big Data preferida, que es Apache Kylin:





    -x50 faster 'near real time' Big Data OLAP Analytics Architecture
    Use Case “Dashboard with Kylin (OLAP Hadoop) & Power BI”
    Cuadros de mando con Tableau y Apache Kylin (OLAP con Big Data)
    BI meet Big Data, a Happy Story
    7 Ejemplos y Aplicaciones practicas de Big Data
    Analysis Big Data OLAP sobre Hadoop con Apache Kylin
    Real Time Analytics, concepts and tools 
    Hadoop Hive y Pentaho: Business Intelligence con Big Data (Caso Practico)


    Hoy os vamos a contar sobre otras alternativas gracias a Roman Lementov:

    I want to compare ClickHouseDruid and Pinot, the three open source data stores that run analytical queries over big volumes of data with interactive latencies.
    ClickHouse, Druid and Pinot have fundamentally similar architecture, and their own niche between general-purpose Big Data processing frameworks such as Impala, Presto, Spark, and columnar databases with proper support for unique primary keys, point updates and deletes, such as InfluxDB.
    Due to their architectural similarity, ClickHouse, Druid and Pinot have approximately the same “optimization limit”. But as of now, all three systems are immature and very far from that limit. Substantial efficiency improvements to either of those systems (when applied to a specific use case) are possible in a matter of a few engineer-months of work. I don’t recommend to compare performance of the subject systems at all, choose the one which source code you are able to understand and modify, or in which you want to invest.
    Among those three systems, ClickHouse stands a little apart from Druid and Pinot, while the latter two are almost identical, they are pretty much two independently developed implementations of exactly the same system.
    ClickHouse more resembles “traditional” databases like PostgreSQL. A single-node installation of ClickHouse is possible. On small scale (less than 1 TB of memory, less than 100 CPU cores). 

    ClickHouse is much more interesting than Druid or Pinot, if you still want to compare with them, because ClickHouse is simpler and has less moving parts and services. I would say that it competes with InfluxDB or Prometheus on this scale, rather than with Druid or Pinot.
    Druid and Pinot more resemble other Big Data systems in the Hadoop ecosystem. They retain “self-driving” properties even on very large scale (more than 500 nodes), while ClickHouse requires a lot of attention of professional SREs. Also, Druid and Pinot are in the better position to optimize for infrastructure costs of large clusters, and better suited for the cloud environments, than ClickHouse.
    The only sustainable difference between Druid and Pinot is that Pinot depends on Helix framework and going to continue to depend on ZooKeeper, while Druid could move away from the dependency on ZooKeeper. On the other hand, Druid installations are going to continue to depend on the presence of some SQL database.
    Currently Pinot is optimized better than Druid. (But please read again above — “I don’t recommend to compare performance of the subject systems at all”, and corresponding sections in the post.)

    Sabes quién creó el término 'Data Lake'?

    What is a data lake?

    A data lake is a repository designed to store large amounts of data in native form. This data can be structured, semi-structured or unstructured, and include tables, text files, system logs, and more.
    The term was coined by James Dixon, CTO of Pentaho, a business intelligence software company, and was meant to evoke a large reservoir into which vast amounts of data can be poured. Business users of all kinds can dip into the data lake and get the type of information they need for their application. The concept has gained in popularity with the explosion of machine data and rapidly decreasing cost of storage.
    There are key differences between data lakes and the data warehouses that have been traditionally used for data analysis. First, data warehouses are designed for structured data. Related to this is the fact that data lakes do not impose a schema to the data when it is written – or ingested. Rather, the schema is applied when the data is read – or pulled – from the data lake, thus supporting multiple use cases on the same data. Lastly, data lakes have grown in popularity with the rise of data scientists, who tend to work in more of an ad hoc, experimental fashion than the business analysts of yore.


    12 abr. 2018

    Generatedata.com (crea datos de ejemplo para tus pilotos)


    ¿Necesitas personalizar el formato de los datos de ejemplo o prueba? 

    Pues bien, esa es la idea de este programa (Generatedata.com) Es una herramienta libre y de código abierto escrita en JavaScript, PHP y MySQL que te permite generar rápidamente grandes volúmenes de datos personalizados en una variedad de formatos para su uso en pruebas de software, rellenar bases de datos, etc.


    Como elegir el mejor grafico para cada necesidad?



    Muy útil este diagrama de 'chart chooser', gracias a stephanieevergreen.com


    10 abr. 2018

    Diccionario Business Intelligence: KPI

    kpi

    Continuamos con nuestro Diccionario Business Intelligence, encaminado a hacer lo más sencillo posible conocer conceptos. Ya hemos comenzado con Molap y Análisis Adhoc.

    Hoy le toca el turno a los KPI´s (Key Performance Indicators). Indicadores Claves de Negocio:
    Diríamos que son aquellos indicadores, cálculos, ratios, métricas, etc... que nos permiten medir los factores y aspectos críticos de un negocio. Algunos ejemplos serían las ventas mensuales de las principales lineas de negocio, el coste de las materías primas principales, la evolución de la masa salarial, el nivel de liquidez, etc...
    Estos indicadores deben observarse y analizarse dentro del conjunto de dimensiones o ejes representativos del negocio: tiempo, productos, centros de costes, etc...

    Puedes ver en funcionamiento un ejemplo de herramienta Balance Scorecard, basada en Open Source: STCard

    Por ello, los KPI´s no son un término tecnológico, generado por el Business Intelligence, si no que se trata de un concepto ligado a la Gestión Empresarial. No obstante, el desarrollo de la tecnología y de especialidades como el Business Intelligence, han permitido que su medición, control y respresentación visual se haga de un modo mucho más eficiente y rápido.
    Si pretendemos llevar una gestión eficiente de nuestro negocio, apoyándonos en el uso de herramientas Business Intelligence, y no usamos los KPI´s, es como si estuviéramos conduciendo por una carretera de montaña de noche sin luces.

    Normalmente, en la definición de los KPI´s se usa un acrónimo, SMART, que ayuda en el proceso de selección de los mismos:

    - eSpecificos (Specific)
    - Medibles (Measurable)
    - Alcanzables (Achievable)
    - Realista (Realistic)
    - a Tiempo (Timely)


    Los KPI´s han cogido mucha más relevanca si cabe, conforme se ha ido extendiendo y popularizando el uso de la Metodología de Balance Scorecard, Cuadro de Mando Integral, creado por los 'archiconocidos' profesores Norton y Kaplan.
    Presentado en 1992, el Cuadro de Mando Integral o balance scorecard (BSC) es un método para medir las actividades de una compañía en términos de su visión y estrategia. Proporciona a los administradores una mirada general del rendimiento del negocio.


    Es una herramienta de management que muestra continuamente cuando una compañía y sus empleados alcanzan los resultados perseguidos por la estrategia.


    En la representación visual de un Balance Scorecard, es muy importante tener en cuenta aspectos tales como:

    - Establecer los indicadores (KPI´s) por áreas o perspectivas
    - Uso de codificación semafórica (amarillo, rojo y verde) para resaltar tendencias y excepciones
    - Indicar de forma detalla explicaciones del comportamiento esperado y objetivo de cada kpi.
    - Establecer el departamento y/o persona responsable de cada kpi (su definición, medición objetiva y esperada, umbrales de referencia, etc...)
    - Establecer el periodo de análisis para el que se mide y revisa su valor.
    - Definir las acciones o tareas correctivas derivadas de un comportamiento fuera de los umbrales esperados.

    9 abr. 2018

    Un dia en la vida de un Data Scientist


    Muy ilustrativo este video del día a día de un Data Scientist

    5 abr. 2018

    15 Errores con los datos que debes evitar



    Muy interesante esta infografía que puedes descargarte desde aquí, en donde se muestran y explican 15 típicos fallos que nos pueden llevar a tomar malas decisiones cuando trabajamos con datos. 

    Imprescindible!!


    Visto en Geckboard