Data Engineering is the foundation for the new world of Big Data.
Reddacity may receive an affiliate commission if you enroll in a paid course after using these buttons to visit Udacity. Thank you for using these buttons to support Reddacity.
Reddit Posts and Comments
1 posts • 19 mentions • top 18 shown below
25 points • world_is_a_throwAway
You don't actually have Data Engineering experience .
Based on the above information. You have experience as a Database Administrator.
(Sorry torch bearers, solely ETL does not qualify you as a DE)
It's definitely very common for db admins to be wanting to transition to a more broad data engineer role (if you want career growth at all) so it's a known move to both recruiters and hiring managers. Yes, your skill set aligns with a lot of necessary ones for DE.
However, until you are working on things like (just one example) your own distributed compute cluster i.e. Spark, manage its compute resources, extract data with said cluster, write it to an ETL pipeline, and deliver the output, all in a scalable way, AND be able to talk about its optimization potential, pros and cons of decisions along the pipeline, you aren't working in data engineering. Also dude, seriously, get Python and Scala ASAP. Maybe even GoLang?
With all of that being said, u/BeerMang is absolutely right here:
>"....recruiters are dying for DE talent. Apply anyway, be honest, and the right company will hire you based on your CAPACITY to pick up a new/exciting tech rather than your existing knowledge in it."
Because personal data engineering projects are usually a bit more abstract and harder to define a use case for than most single node applications. Here are some specific bullets:
- Apply everywhere and always. I can't tell you how many hiring processes I've been involved in that have brought up modifying the role for 'someone like me.' My last job search I am sure I had in well over 300 applications. I turned down 4 offers before finding a great fit and a company that was actually prepared to support a data engineer. --> Point is: you're obviously good at _xyz_data, so you should interview based on your capability to learn new_xyz_data
- Learn everything you can on distributed computing clusters. They are not the future, they are the now. You need this ability to disect, manage, and distribute workloads and workload flows! You need this yesterday
- Get in the cloud. Pick a project. Hell, use AWS/Azure/Google 's tutorials. They are all really good and it's a great spin up into cloud tech.
- AWS Certification appears to be one that is highly specific to data engineering and is asked for more than other cert requests I have seen. If you can get this on your own; A) either the solutions architect or B) Data engineering you will be golden at so many interviews.
- Learn some networking: I hate to say this part but lots of companies want to treat the cloud and distributed computing like internal networks. So it's essential you have some understanding of network security and how your pipeline can safely transport and contain data.
- MOOC's Pluralsight, Udacity, Udemy, whateverthefuckuniversityonlinethatisdemocratizingeducation, etc.
- Udacity has an expensive but pretty darn good course is data_engineering_nanodegree
Okay that's all for now. Definitely start applying and interviewing. I learned much of what this field actually entails by showing up to interviews and bombing a lot of them, and crushing some of them. But with each one I got better and got more information on what the most common intersections are when companies are posting Data Engineer reqs.
Get out there and get some! Good luck.
24 points • IndoSpike
Learning Data Engineering- New Course on Udacity, thoughts?
Udacity has started a new course to learn Data engineering with Big data technologies. As someone wanting to break into data engineer roles for big data, I was wondering what the community thinks of the syllabus and learning on Udacity.
55 points • WannaBeGISGuru
Udacity Data Engineering Nanodegree Course Review
[Udacity's new Data Engineering Nanodegree](https://www.udacity.com/course/data-engineer-nanodegree--nd027) is one of the few data engineering courses out there right now. It is geared towards people that already have programming experience, specifically with Python and SQL. Udacity estimates that it would take someone 5 months to complete if they committed 5 hours a week (\~108.6 hours of content) at the one-time price of $999 USD (this has changed since I started and is now $399 USD / month). The course is broken up into five sections, Data Modeling, Cloud Data Warehouses, Data Lake with Spark, Data Pipelines with Airflow, and a capstone project. Each section has different instructors, with each one bringing a different teaching style in a way that keeps things refreshing while still keeping you wondering if it happened simply due to lack of communication. The structure for each section consists of introducing concepts through lectures, reinforcing the material with demos and exercises (typically in a Jupyter Notebook), and concludes with 1-2 project(s) dealing with designing an ETL process using song data for an imaginary company called Sparkify.
# My background
I have about two years of professional experience wrangling data with Python and SQL and about a year and a half of web development experience. I have a bachelors degree in engineering and took a few introductory computer science courses. A few months ago I completed [Dataquest's Data Engineering Path](https://www.dataquest.io/path/data-engineer/) and have taken a few [DataCamp](https://www.datacamp.com/) courses as well as [CS50](https://www.edx.org/course/cs50s-introduction-to-computer-science) and [CS50 Web](https://cs50.harvard.edu/web/2019/spring/). I enrolled in this course due to its focus on cloud technologies, which I have been learning through trial by fire at a data engineering job I started a few months ago, mostly using AWS, Postgres, Python, and Airflow.
# Individual Sections Review
## Data Modeling
This section introduces what data modeling is, why it's important, and what the differences between a relational and NoSQL database are. It speaks on important concepts such as ACID transactions, what fact and dimensional tables are, and what the difference between star and snowflake schemas is. This section uses Postgres and Apache Cassandra and consists of a project for each of them where you design schemas and load song logs and song metadata into fact and dimension tables.
- Introduces most of the Postgres and Apache Cassandra commands a data engineer would probably ever use
- Provides a good explanation on when you'd want to use SQL vs. NoSQL
- Most lectures consisted of watching the lecturer read slides off her laptop
- This section's exercises seemed to have more bugs than the rest
- There were a few questionable practices in this section such as a try / except block around everything and always inserting rows individually instead of in bulk
## Cloud Data Warehouses
This section builds on the previous section and explains the need for a data warehouse and what the benefits of hosting it in the cloud are. AWS basics such as IAM, creating an EC2 instance, and security groups are introduced, as well as a brief introduction to infrastructure as code using boto3. Other concepts such as OLAP cubes, rollup, drill-down, grouping sets, and columnar storage are discussed. The project consists of designing tables in Redshift and loading data from S3 to Redshift.
- Provides practical example exercises such as loading S3 files in bulk to tables in Redshift using the COPY command
- Makes creating a sandbox data warehouse environment much more approachable. Prior to this I always thought it would be too expensive and complicated to build one on my own and this section proved me wrong
- Tries to cover too much ground. Topics like infrastructure as code are glimpsed over and overly simplified
## Data Lakes with Spark
Introduces what big data is and why big data tools like Hadoop and Spark are necessary. Provides a conceptual overview of how distributed systems like Hadoop and Spark work. Hands-on exercises consist of using PySpark to wrangle data. Explanations of why an organization may need a data lake instead of a data warehouse are provided. The project consists of ingesting raw S3 files, creating fact and dimension tables, partitioning them and writing them back to S3 all with PySpark.
- Provides an excellent explanation of how distributed file systems and cluster computing works
- Gives a good explanation on when to use PySpark data frames vs PySpark SQL and how to use them interchangeably
- This project involved filling in a lot more blanks than the rest of the projects and I found it to be particularly time-consuming. The number of files to ingest from S3 seemed too large to run in a reasonable amount of time
- I wish it would have included more information and exercises about using PySpark on a cluster of machines instead of on a single local one
## Data Pipelines with Airflow
Data pipelines, DAGs, and Airflow concepts such as operators, sensors, and plugins introduced. The final project involves using Airflow to load S3 files into partitioned Redshift tables and perform data quality checks afterward.
- The only tutorial I've found on how to use data quality checks with Airflow. I've started using this technique at work and it is a game changer
- Airflow is a bitch to deploy and someone they engineered a way for people to run it on Udacity's workspaces. Kudos to the engineers on that
- This section felt a bit shorter and was more focused around a specific technology than the other sections. Not necessarily a con but I would have liked to have the lectures be more generalized around the concepts of a data pipeline
Overall, I really enjoyed this nanodegree and learned a lot of practical things from it that I have already started using at my job. I would estimate I spent about 40 hours completing it so I definitely felt short-changed in content and think it is incredibly overpriced for what it is. What I don't like is how Udacity markets their courses as a way for someone to make a career change with no real-world experience. I find that incredibly hard to believe and can't imagine a company hiring someone with no real world experience after completing this nanodegree. I found the content to mostly be of a very high quality and I think this is really the only intermediate-advanced data engineering course out there. If you have the cash and are interested in learning data engineering in the cloud I would highly recommend it.
2 points • jrich8573
Look into Udacity’s Data Engineering Nano degree https://www.udacity.com/course/data-engineer-nanodegree--nd027. However, if you are looking to switch careers, you should play to your strengths; namely sql, data modeling, pushing and pulling data.
2 points • dash_365
1 points • kilmongerrr
Udacity announced their free 30-days access to people from US and Europe. They have Data Engineering track which has spark,python and other few big data courses as a part of their program. Do check out.udacity's Data Engineer with Python
1 points • Folasade_Adu
In doing this right now and like it. You can sign up for the free month and try to get as much done as you can.
3 points • sanchit089
Here is the link to get more details: https://www.udacity.com/course/data-engineer-nanodegree--nd027
They are currently at $1195 for 5 months, they do offer "Pay as you go" option as well which is $269 per month.
I would suggest going for the per month option.
2 points • xiBraHem
Here's the list of projects:
- Data Modeling with Postgres and Apache Cassandra
- Data Infrastructure on the Cloud (AWS)
- Big Data with Spark
- Data Pipelines with Airflow
and the last project is a capstone, which you will combine all that you have learned during the course. You will gather data from different data sources and perform ETL to create a clean database for analysis.
You can find more details in the link below :
2 points • godofwar5-2020
I just started taking this one:
It costs a lot, but because there were basically no other resources that seemed more comprehensive than this, I went for it. I've heard good things from current students also.
1 points • rvazquezglez
I paid $899.10 after a 10%-discount for students who enrolled in the inaugural class in a 5-months term.
It seems the pricing model has changed and you can pay monthly ($399) or 5 month in advance ($1795), if you dedicate full time to it you can cover it in 1 or 2 weeks.
Here's the pricing page: https://www.udacity.com/course/data-engineer-nanodegree--nd027
1 points • forbiscuit
Oh nice! Would you been interested in Data Engineering? Udacity's Data Engineer program is on point and gives great foundation to expanding your tech skills - https://www.udacity.com/course/data-engineer-nanodegree--nd027
1 points • augustuscauchy1029
Since your company is paying for it, i recommend browsing through these https://www.udacity.com/course/data-engineer-nanodegree--nd027
1 points • Tamiyo22
Here are some things I have found so far
(I would like to endorse Datacamp. It is not intensive, but it helped me get the ball rolling quickly with a few skills for my internship)
I hope these help!
1 points • DataNerd555
I don't know if you have the funds, but I would recommend checking out the data engineering nano-degree at udacity. If you apply during these times you can get a 40% discount because of COVID-19. This should give you the basics. https://www.udacity.com/course/data-engineer-nanodegree--nd027
I think you made a good call with data engineering, this is a really good role and frankly not enough people are applying for this kind of job. Too many people are trying for data science which is getting very saturated, but data engineering has plenty of void that needs to be filled.
Your work experience as a bank teller and a VW sales representative are not tech, which is bad, but you need to tell your employer during the interview that it show cases that you understand the working environment, how to communicate and collaborate with others and all the rest of those soft skills.
As far as the degree goes, you don't really need more schooling. What you need is proven experience that you can get the job done. For that, you need to convince the employer. Most employers will evaluate you based on:
- experience (not an option in this case),
- interview challenges (totally possible to prepare) [go to leetcode or find practice data engineering interview questions online]
- past projects (totally possible to prepare). [udaccity, udemy, coursera, online]
Your best bet is to break into a company that will teach you and you pick up the skills and the title and grow from there. I don't know how much you have gotten your feet wet with data engineering, but I would do a few projects and certificate just to familiarize yourself and boost portfolio (unless you know of someone that would give you internship or job right off the bat, if that is the case place the experience as the top priority)/
- Udacity has a data engineering nanodegree
- GCP offers this certificate through coursera
- Specialization in coursera
Also a good tip that would be helpful for your to break into your first job is applying for startups (cause once you get that first data engineering job it becomes a lot easier,) Generally, startups have a much lower barrier to entry and frankly need data engineers too. They will take a chance on you more than big companies. Once your there, you will be getting very hands on and learning fast and after a year or so you have that title and proven record and can jump somewhere else.
Happy to talk more but would love thoughts.
1 points • linkerzx
It is definitely not worth going through 4 years to get through Big data. You should either go for online courses, a Bootcamp program or potentially a master degree rather than going for a full-on bachelor course. So either:
1) Follow some online courses - this should get you at least started with some knowledge in the domain. EDX, Coursera or Udacity all offer a decent offering to get started in the domain. Before going full-on on Big data I would suggest taking a few more courses on Python and SQL. I have recently written about how to learn data science from online resources and I think going into Big data does require some similar base knowledge in terms of Python/SQL.
EDX has a few nice courses from BekeleyX on Apache Spark (a framework used in Big data):
- BerkeleyX: CS100.1x Introduction to Big Data with Apache Spark
- BerkeleyX: CS105x Introduction to Apache Spark
- BerkeleyX: CS120x Distributed Machine Learning with Apache Spark
- BerkeleyX: CS190.1x Scalable Machine Learning
Udacity has an introduction to data engineering - which is currently in free access for a month.
2) I am not sure where you are based, but data engineering bootcamps might be available. The most famous data engineering Bootcamp is the Insight program in the US, which teaches the basics of data engineering in about 7 weeks. Chances are there might be a data engineering Bootcamp program close to where you live.
3) Aim for a postgraduate / Master degree in big data, analytics, or data science - at least that way you would limit the time spent learning from 4 years to 1 single year. Given that you are already pursuing a postgraduate course, I don't think this would be the preferable option, compared to self-learning with some MOOC.
1 points • zahnkil
A few graduate programs I am aware of: https://www.edx.org/masters/online-master-science-computer-science-utaustinx
No clue if these are any good and/or worth it: https://www.udacity.com/course/data-engineer-nanodegree--nd027