Submitted by MLRecipes t3_10wjenb in MachineLearning

The book has considerably grown since version 1.0. It started with synthetic data as one of the main components, but also diving into explainable AI, intuitive / interpretable machine learning, and generative AI. Now with 272 pages (up from 156 in the first version), the focus is clearly on synthetic data. Of course, I still discuss explainable and generative AI: these concepts are strongly related to data synthetization.

Agent-based modeling in action

However many new chapters have been added, covering various aspects of synthetic data — in particular working with more diversified real datasets, how to synthetize them, how to generate high quality random numbers with a very fast algorithm based on digits of irrational numbers, with visual illustrations and Python code in all chapters. In addition to agent-based modeling newly added, you will find material about

  • GAN — generative adversarial networks applied using methods other than neural networks.
  • GMM — Gaussian mixture models and alternatives based on multivariate stochastic and lattice processes.
  • The Hellinger distance and other metrics to measure the quality of your synthetic data, and the limitations of these metrics.
  • The use of copulas with detailed explanations on how it works, Python code, and application to mimicking a real dataset.
  • Drawbacks associated with synthetic data, in particular a tendency to replicate algorithm bias that synthetization is supposed to eliminate (and how to avoid this).
  • A technique somewhat similar to ensemble methods / tree boosting but specific to data synthetization, to further enhance the value of synthetic data when blended with real data; the goal is to make predictions more robust and applicable to a wider range of observations truly different from those in your original training set.
  • Synthetizing nearest neighbor and collision graphs, locally random permutations, shapes, and an introduction to AI-art

Newly added applications include dealing with numerous data types and datasets, including ocean times in Dublin (synthetic time series), temperatures in the Chicago area (geospatial data) and the insurance data set (tabular data). I also included some material from the course that I teach on the subject.

For the time being, the book is available only in PDF format on my e-Store here, with numerous links, backlinks, index, glossary, large bibliography and navigation features to make it easy to browse. This book is a compact yet comprehensive resource on the topic, the first of its kind. The quality of the formatting and color illustrations is unusually high. I plan on adding new books in the future: the next one will be on chaotic dynamical systems with applications. However, the book on synthetic data has been accepted by a major publisher and a print version will be available. But it may take a while before it gets released, and the PDF version has useful features that can not be rendered well in print nor on devices such as Kindle. Once published in the computer science series with the publisher in question, the PDF version may no longer be available. You can check out the content on my GitHub repository, here where the Python code, sample chapters, and datasets also reside.

31

Comments

You must log in or register to comment.

thiru_2718 t1_j7o82dn wrote

Nice work! There's some intriguing sections here that I definitly want to take a look at.

Quick question, with regards to this quote in the preface: "For instance, regression techniques ... are presented as a single method, without using advanced linear algebra."

Are you referring to Generalized Linear Models? I don't see any references to GLMs, in my brief skim, but I can't think of how else regression can be presented as a single method.

Also, is there any place where we can get a preview of "Shape Classification and Synthetization via Explainable AI" section?

6

MLRecipes OP t1_j7oec5y wrote

No, it does encompass GLM but the technique also works when there is no response (you then need to put a constraints on the parameter) or with truly non linear models with time series examples in the book. Or for particular clustering cases. I like to call it unsupervised regression, but a particular case with appropriate constraint on the parameters corresponds to classic regression. More about it here. As for shape classification, see here.

3

Parzival_007 t1_j7otlv7 wrote

Good work ! Ill give it a read and give any feedback !

2

Iunaml t1_j7q38uc wrote

So that's an ad.

I don't like this "subtle" style of marketing. We're talking about a $63 book and yet the first sentence is puzzling.

2

JackBlemming t1_j7ql45e wrote

There's nothing wrong with relevant self promotion, especially if it's high quality material. Obviously bad/irrelevant stuff should be removed, but that's up to the mods discretion.

I personally bookmarked this for later as it's very interesting to me.

3

Iunaml t1_j7tn4yr wrote

> There's nothing wrong with relevant self promotion, especially if it's high quality material.

Who is the judge?

Do I really care of the quality, if it's a paid book that is not upfront about its price? What could it tell us about the author and the information contained inside the book?

1