Big Data in Times of Truthspeak and Ingsoc

Big Data and Small Data

We live in a world of “Big Data”. If you’ve ever shopped online and then afterwards you start seeing “suggestions” to make follow on purchases, your personal data was used and run through a Data Science algorithm and most likely then sold to other companies. That’s why you seem to be “followed around” and “coaxed” to buy or die.

The California Consumer Privacy Act – or CCPA

On January 1. 2020, the US will have its first “Data Protection” regulation where you the consumer (in California) can opt out of having your data sold and bandied about. About 2 years ago I “fell into” this field, and have become an early expert on how to engineer data systems to handle such regulations. In Europe – they have had GDPR for more than a year – so CCPA is roughly the US equivalent.

In or around 2010, a handful of new technologies were developed to make computing cheaper and storage of data dirt cheap. Companies like Yahoo, Google, Amazon and Facebook pioneered technologies with odd names like Hadoop, Cassandra, etc. Those technologies are now all long in the tooth, and other technologies are replacing them, but they sparked the age of “Big Data”.

The idea behind Big Data is that if you have enough data about a person or a group of people – you can make statistical calculations and find out their behavior patterns, such as what they have purchased. Then, you can try to “sell” to these people because they are somewhat “qualified” as likely buyers. A whole new job description emerged – “Data Scientists” – who are math experts who also can manipulate data. Personally, I think the whole genre is way over hyped and way over sold – mainly because the Quality of the data that they use is usually terrible and their models don’t take that into consideration – because they don’t even know what the quality is. But I digress . . .

The downside of big data is the Russians can hack our systems and data with the help of Cambridge Analytica and dupe Facebook into trashing our democracy – so now you know how powerful this is. Then Twitter can do the Devils Work and promote the devil as they do. Some say this also happened with Brexit – another trashed democracy. This is how you get morons like Trump and Johnson in power. A handful of smart people duping a lot of ignorant people – who drool over “Truthspeak” (fake news) and Ingsoc (Trumps GOP Party).

Part of the problem with Big Data is that its not managed well. In the last 10 years – companies have pack-ratted the data (your personal information) and have left it all over the place – sometimes safe – most times not. I have worked for companies that had 60+ thousand database tables and 60+ million Amazon S3 files – and they had no idea what they had and where. Developers built systems – threw data everywhere, left these companies and in their wake – no one had the guts to delete all that old data.

This week I figured out a way to track 66 million S3 files that had bad naming standards and made sense of it – with between .01 – .06% worth of “meta data” (small data). Meta data is basically “data about data”. For example, a detailed list of all files and where they reside is an example of meta data. If you know how to do forensics on small data – you can really get a leg up on big data.

When I built the system I did this summer (LendUp Data Discoverer) – that tracks database tables – that was relatively easy – because the data is “structured”. The problem with files is that it is unstructured – and harder to define. Anyway. this week I cracked that nut – and it really is the “Holy Grail” of managing a sea of data files. Its such a big deal that expensive systems that track meta data like Collibra and Alation leave it up to you to deal with file data.

I even figured out how to automatically annotate (describe) files based on several “digital fingerprints” and so now I have completed the huge task of annotating 60+ million files.

You can’t comply with CCPA or GDPR until and unless you know how much data you have, where it is, and what it is. I now can do that – end to end, and my goal was to be able to do this by Thanksgiving – so I am a month ahead of schedule.

What a great week.



