ICT Insight with Institute of ICT Professionals: Data formats to work with as professionals

0

By Kaunda ISMAIL

File format is a critical aspect of the data profession. While tools and roles often grab the spotlight, knowing how to manage and work with different file formats is very important.

Why Do We Need to Understand Data Formats?

Before jumping into the technicalities, let’s answer this simple question: why bother learning about data formats? Here’s why:

  • Interoperability: Imagine trying to send an appointment letter to your secretary who only has Excel on his computer. How will he be able to open it? That will be Chaos! Similarly, understanding data formats ensures smooth data sharing across tools, platforms, and people.
  • Efficiency: Choosing the right format optimizes storage and processing, saving time and resources.
  • Accuracy: Compatibility issues can lead to errors. Therefore, when you know the right file formats to work with, you minimize errors and that increases your accuracy.
  • Scalability: Whether working with small Excel files or terabytes of data in a cloud warehouse, knowing the formats helps you adapt seamlessly.

Main Categories of Data Formats

Data formats fall into three broad categories:

  1. Structured Data Formats

The simplest way to think about this is that: any data that you can open with MS Excel is structured. Structured data formats are highly organized/structured, with rows, columns, and clearly defined relationships. Examples include:

  • Excel (XLS, XLSX): A household name in data handling, Excel is more than a spreadsheet tool—it’s a data powerhouse. It supports advanced formatting, formulas, and visualizations.
  • Delimiter-Separated Values (DSV): This is mostly referred to as Comma Separated Value (CSV) data. This is data that is separated by comma (“,”). This same data can also be separated by any other thing, not just comma. It can be separated by Tab, Dash, Pipe, etc. How the file will be called is determined by what character separates the values. CSV for comma separated values, TSV for tab separated values, PSV for pipe separated values. In general, these files are called Delimiter Separated Values because comma, tab, dash, pipe, etc. are delimiters. More details can be found at facebook.com/kaundaAI/videos/483751051430981/
  • SQL Databases: Relational databases like MySQL, PostgreSQL, and SQL Server store data in structured tables. Therefore, they are also considered structured data.
  1. Semi-Structured Data Formats

Semi-structured formats are like a mix between being organized and flexible. They allow you to have some structure but also give you the freedom to handle data in different ways. They are great for data that doesn’t always follow the same pattern. Here are some examples:

  • JavaScript Object Notation (JSON): Think of JSON as a simple and easy way for computers to talk to each other, especially on websites. It’s like a list with items inside that are easy to read and write for both humans and machines. You’ll often see JSON when you interact with APIs.
  • eXtensible Markup Language (XML): XML is like a more detailed version of JSON. It’s used to store and share data, but it uses a lot of tags (like the ones you see in a book’s table of contents) to explain what each piece of data means. It’s powerful but a bit longer and harder to read compared to JSON.
  • YAML Ain’t Markup Language (YAML): YAML is another format that’s used for configuration files (think of it like settings for an app or software). It’s super easy for humans to read because it looks like plain English, making it perfect for writing down instructions or settings.
  • Parquet: Parquet is used for big, complex data that needs to be stored efficiently. It’s designed for situations where you’re dealing with a lot of data, like big data systems or when you’re working with tools like Apache Spark. It makes the data easier to manage and faster to process.
  1. Unstructured Data Formats

Unstructured formats are the wild cards of data. They don’t have any specific rules or structure to follow, which makes them more flexible, but also harder to work with sometimes. Here are a few examples:

  • Text Files: These are simple files like TXT, PDF, RTF, DOCX, and even Markdown. They can contain anything from a short note to a long book, but they don’t have any special organization to them. They’re like a blank sheet of paper—you can write anything, but it may not always be easy to find something specific.
  • Multimedia Files: These are the files that you probably deal with every day, like images (JPEG, PNG), videos (MP4, AVI), and audio files (MP3, WAV). They’re a bit different from text because they contain things you can see or hear, and the data is stored in a way that’s harder to read without special tools.
  • Log Files: Log files are created by systems or applications (think of them like the “diary” of a computer). They’re usually in plain text and are used to track what’s happening in a system—like errors or activity logs. They’re important when things go wrong, as they help troubleshoot and figure out what happened.
  1. Other Data Formats

Aside from the three main categories mentioned above, there are other data formats too.

Big Data Formats: these are used when you’re dealing with massive amounts of data that need to be stored and processed efficiently. Formats like AVRO, which stores data row-by-row, and ORC, which stores data in columns, are made for big data frameworks like Hadoop. Then there’s HDFS, which helps store huge datasets across many computers in a distributed system.

Binary Data Formats: These are used when you need to process data super quickly and in a smaller size. Formats like Protobuf (Protocol Buffers) and MessagePack are designed to be fast and efficient, making them great for high-performance data exchanges.

Geospatial Data Formats: These include formats like Shapefiles, GeoJSON, and KML. These are perfect for storing and sharing geographic data, whether you’re mapping out roads, places, or tracking movements on a map.

Real-Time Data Formats: If you’re working with streaming data, such as from sensors or IoT devices, formats like Kafka Messages and MQTT Packets are built to send and process data as it’s happening in real-time.

How to Choose the Right File Format

Choosing the right file format can be tricky because in the business settings, you will likely be working with different types.

The points listed below are just guides

  • Nature of the data: First, think about the nature of your data. If your data is in a neat, table-like structure with rows and columns (like sales data), go with CSV or Excel. If your data is organized in a tree-like structure (think of categories, subcategories, and items), then JSON or XML are your best bet. For things like images, videos, or audio, you’re looking at JPEG, MP4, or similar formats.
  • Use Case: Next, think about the use case. Are you analyzing the data? Stick with CSV or Excel—they’re easy to work with. If you’re sharing data between different tools or systems, JSON and XML are excellent for that, especially in APIs. For saving things for future reference (like contracts or reports), PDF is a solid option.
  • Tool Compatibility: Finally, make sure the format works with your tools. For instance, if you’re using Power BI for analysis, it works perfectly with Excel, CSV, or SQL databases. But if you need to work with XML or JSON, you might have to do some extra work to get them into Power BI.

Final Thoughts

Data formats are the foundation of everything we do as data professionals. While they might seem simple on the surface, understanding their differences and how to handle them can make a world of difference in your career. So, try all you can to understand them. It’s a skill that will pay off whether you’re creating dashboards, writing SQL queries, or working on machine learning models.

Author: Kaunda Ismail | Member, IIPGH

For comments, contact: [email protected], Facebook.com/KaundaAi