What is it?
One might say that Apache Spark is an all purpose calculating platform comprising multiple elements, such as Spark Core, Spark SQL, Spark Streaming or Spark MLlib itself, which is responsible for machine learning.
Apache Spark is an open source environment which enables processing large amounts of data using operational memory. As a result, it yields up to 100-fold processing speed increase compared to such technologies as, for instance, Hadoop. It is a distributed system, which makes it possible for it to be easily scaled with growing business needs.
Spark may be launched alongside Hadoop, Mesos, in an entirely new environment or in cloud. It connects to such data sources as: HDFS, Cassandra, HBase, S3 or popular SQL databases – PostreSQL, Oracle, MySQL. Spark may operate in “standalone” mode and in a cluster, so there are multiple configuration options and more often than not it may be adjusted to existing IT environments.
Spark MLLib is one of the elements of the Apache Spark framework and it utilizes all of its advantages. It allows to apply machine learning to large data sets with no scalability concerns. The system has dozens of built-in machine learning algorithms which may be applied depending on a particular business case.
These include:
- Classification: logistic regression, naive Bayes
- Regression: generalized linear regression, isotonic regression
- Decision trees: random forests, and gradient-boosted trees
- Recommendation: alternating least squares (ALS)
- Clustering: K-means, Gaussian mixtures (GMMs)
- Topic modeling: latent Dirichlet allocation (LDA)
- Feature transformations: standardization, normalization, hashing
- Model evaluation and hyper-parameter tuning
- ML Pipeline construction
- ML persistence: saving and loading models and pipelines
- Survival analysis: accelerated failure time model
- Frequent itemset and sequential pattern mining:
FP-growth, association rules, PrefixSpan - Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA)
- Statistics: summary statistics, hypothesis testing
Another advantage, undoubtedly, is the fact that the system operates on an open source licence and is one of Apache Foundation’s distinguished projects which is being developed by such commercial partners as IBM, Facebook, Yahoo!, Intel, Cloudera, Hortonworks, Netflix and many others. A full list is available here. Using Spark in commercial projects is possible owing to the Apache licence.
What is it used for?
Companies use Apache Spark MLLib to improve the quality of their operations. By using machine learning algorithms, the software makes it possible to discover new information regarding the organization’s operations.
Consequently, customer service, production, distribution or UX processes may be improved.
Examples include companies operating in the insurance, technology or finance sectors.
Several example uses have been listed below:
Insurance:
Optimizing customer service by applying machine learning to sorting client queries by topic. Messages are directed to specialized staff and the client receives a to-the-point answer.
Insurance, finance:
Scoring model optimization for clients.
Finance:
Using predictive models to anticipate clients’ credit profiles for particular banking products.
Finance:
Real-time stock exchange data analysis which helps predict future stock exchange behaviour.
Public institutions:
Spending analysis depending on situation, time, category.
Health care:
Patient data analysis to expedite diagnostics.
Spark MLlib also anticipates flight delays for aviation companies, real estate prices on various markets, supports marketing processes by searching social media and the like.
Our experience
BlueSoft successfully uses the Apache Spark MLlib technology at its clients representing such industries as financial, telecoms or life science, while our expertise allows us to fully utilize its possibilities.
The company has ample experience in the field of business analysis, due to which clients may easily choose the issues which can be optimized using machine learning. BlueSoft’s experienced staff deploy Apache Spark in a timely fashion and with costs in check.
Spark is a platform which, if used properly, immensely benefits an organization, yet what is essential in order to extrapolate maximum value from the data at hand is a degree of data science knowledge.
Undoubtedly though, a well-assembled team and the Apache Spark platform greatly optimize an organization’s operations and improve their product quality.
BlueSoft has successfully implemented many projects in this area. We will happily present our portfolio directly as well as answer more questions about technology itself and benefits to be brought by its implementation.