5 Ways Data Scientists Can Advance Their Careers – Spiceworks News and Insights

Data and machine learning people join companies with the promise of cutting-edge ML models and technology. But often, they spend 80% of their time cleaning data or dealing with data riddled with missing values and outliers,  a frequently changing schema, and massive load times. The gap between expectation and reality can be massive. 

Although data scientists might initially be excited to tackle insights and advanced models, that enthusiasm quickly deflates amidst daily schema changes, tables that stop updating, and other surprises that silently break models and dashboards. 

While “data science” applies to a range of roles, from product analytics to putting statistical models in production, one thing is usually true: data scientists and ML engineers often sit at the tail end of the data pipeline. They’re data consumers, pulling it from data warehouses or S3 or other centralized sources. They analyze data to help make business decisions or use it as training inputs for machine learning models. 

In other words, they’re impacted by data quality issues but aren’t often empowered to travel up the pipeline earlier to fix them. So they write a ton of defensive data preprocessing into their work or move on to a new project.  

If this scenario sounds familiar, you don’t have to give up or complain that the data engineering upstream is forever broken. Make like a scientist and get experimental. You’re the last step in the pipe and putting models into production, which means you’re responsible for the outcome. While this might sound terrifying or unfair, it’s also a brilliant opportunity to shine and make a big difference in your team’s business impact.   

Here are five things data scientists and ML analysts get out of defense mode and ensure that even if they didn’t create data quality issues, they’d prevent them from impacting the teams that rely on data.

1. Increase Trust Through better data quality monitoring

Business executives hesitate to make decisions based on data alone. A KPMG report showed that 60% of companies don’t feel very confident in their data, and 49% of leadership teams didn’t fully support the internal data and analytics strategy. 

Good data scientists and ML engineers can help by increasing data accuracy, then getting it into dashboards that help key decision-makers. In doing so, they’ll have a direct positive impact. But manually checking data for quality issues is error-prone and a huge drag on your velocity. It slows you down and makes you less productive.

Using data quality testing (e.g. with dbt tests) and data observability helps to ensure you find out about quality issues before your stakeholders do, winning their trust in you (and the data) over time.

2. Make SLAs to prevent confusion and blaming

Data quality problems can easily lead to an annoying blame game between data science, data engineering, and software engineering. Who broke the data? And who knew? And who is going to fix it? 

But when bad data goes into the world, it’s everyone’s fault. Your stakeholders want the data to work so that the business can move forward with an accurate picture.  

Good data scientists and ML engineers build accountability for all data pipeline steps with Service Level Agreements. SLAs define data quality in quantifiable terms, assigning responders who should spring into action to fix problems. SLAs help avoids the blame game entirely.

3. Faster analysis through experiments

Trust is so fragile, and it erodes quickly when your stakeholders catch mistakes and start blaming. But what about when they don’t catch quality issues? Then the model is poor, or bad decisions are made. In either case, the business suffers. 

For example, what if you have a single entity logged as “Dallas-Fort Worth” and “DFW” in a database? When you test a new feature, everyone in “Dallas Fort-Worth” is shown as variation A and everyone in “DFW” is shown variation B. No one catches the discrepancy. You can’t conclude users in the Dallas Fort-Worth area – your test has been thrown off, and the groups haven’t been properly randomized.  

Clear the path for better experimentation and analysis through a foundation of higher quality data. By using your expertise to boost quality, your data will become more reliable, and your business teams can run meaningful tests. The team can focus on what to test next instead of doubting the results of the tests.

4. Become the point-person for data quality 

Confidence in the data starts with you; if you don’t have a handle on high-quality and reliable data, you’ll carry that burden into your interactions with the product and your colleagues. 

So stake your claim as the point-person for data quality and data ownership. You can have input into defining quality and delegating responsibility for fixing different issues. Remove friction between data science and engineering. 

If you can lead the charge to define and boost data quality, you’ll impact almost every other team within your organization. Your teammates will appreciate the work you do to reduce org-wide headaches.

5. Minimize data waste 

Incomplete or unreliable data can lead to terabytes of wasted data. That data lives in your warehouse, getting included in queries that incur compute costs. Low-quality data can be a major drag on your infrastructure bill as it gets included in the filtering-out process time and again. 

Identifying complex data is one way to immediately create value for your organization, especially for pipelines that see heavy traffic for product analytics and machine learning. Recollect, reprocess, or impute and clean existing values to reduce storage and compute costs. 

Keep track of the tables and data you clean up, and the number of queries run on those tables. It’s essential to notify your team about how many questions are no longer running on junk data and how many gigs of storage are freed up for better things. 

All data professionals, seasoned veterans, and newcomers should be indispensable parts of the organization. You add value by taking ownership of more reliable data. Although tools, algorithms, and analytics techniques are growing more sophisticated, often the input data is not – it’s always unique and business-specific. Even the most sophisticated tools and models can’t run well on erroneous data.  The impact of data science can be a boon to your entire organization through the above five steps. Everyone wins when you improve the data your teams depend upon. 

Which techniques can help data scientists and ML engineers streamline the data management process? Tell us on Facebook, Twitter, and LinkedIn. We’d love to know!

MORE ON DATA QUALITY MANAGEMENT

Spread the love

Leave a Reply

Your email address will not be published.