Categories
Kdnuggets RSS Feed

Data Extraction Software Octoparse 8 vs Octoparse 7: What’s New

 
 



Octoparse 8 was recently released. Get a better understanding of what the differences between OP 8 and 7 are by reading this overview.


Sponsored Post.

Our brand new version OP 8 just came out a few weeks ago. To help you get a better understanding of what the differences between OP 8 and 7 are, I have concluded all the updates in this article.

Figure

1. Faster scraping Speed

 
We all know how valuable time is, and scraping speed is the core feature to deliver high-quality extraction experience. In version 8.1, local run speed has improved significantly, which is 10 times faster than in the previous version. It took 21mins 41 seconds to extract 100 records of data in OP 7. But now in OP 8, it only takes 2 mins to get the same amount of data from the same web pages!

Figure

Also, we strive hard to provide a stable and smooth scraping experience. OP 8.1 solved tech issues in OP 7 like stuttering, flashback, and software crash. All the common UI interactions with OP 8.1 such as app launching, task configuration and data export can be completed within 0.5s on average.

2. Robust Compatibility

 
1. Mac OS Compatible

The great news is that besides the Windows systems, OP 8.1 is available for Mac OS as well! Mac users have developed a strong voice in demand on the Mac version. You spoke, we listened! So here it is!

[embedded content]

2. Advanced Browser Compatibility

We switched the browser engine from Firefox 7.0 to Google Chrome 8.0. You’ll notice the websites that failed to load with OP 7 can now be loaded with OP 8.1 right away.

3. Hands-Free Workflow is now available

 
1. Auto-detect data fields

Another essential feature that we upgraded is automation. Once you enter a webpage URL in OP 8, you will notice Octoparse detects the website and guesses data fields automatically. For users who are exhausted with writing XPath in OP 7, this brings huge convenience as there’s no need to build the crawlers from scratch.

2. Switch detection results to find the best fit

If the default detection results don’t suit the users’ needs, they can switch the results to capture other layers of data. Octoparse 8 auto-detects multiple layers of web data, which takes hands-free data extraction to another level.

[embedded content]

3. Trigger Nested Web Pages Extraction in one click

“Nested web page” means the detail page that follows certain links on the list page. In OP 7, if you need to capture the nested web page data, you need to create pagination to click through each listing and then scrape detail page content. With OP 8, this process is greatly simplified. Users can trigger the pagination simply by checking one button on the Tips Panel, and it will work on its own getting the detail page for you.

Figure

(Nested Web Page Extraction)

4. A broader and clearer view of the sample data

With the OP 7 version, you have to go back and forth between the browser and control panel to view the data, whereas, with OP 8, there is a data preview section that gives you a broader view of all the data columns before executing the task. What’s more, by pointing and clicking on the specific data, Octoparse 8 will highlight the data in the built-in browser, which helps you easily check its location.

Figure

(Data highlighted automatically)

4. Refreshing Design to have more control

 
The OP 8 whole design is very different from OP 7. In OP 7, the built-in browser sits below the workflow and the customized area. OP 8 interface is super clean with an upgraded menu bar and workflow on the left-hand side of the browser.

Figure

1. Sidebar is smarter

The “Quick Filter” and “Recent Tasks” filters on the sidebar are ready to access your recent project with a simple click. Compared to OP 7, the new release now allows you to access your project quickly without having to go back to the dashboard and search through the entire list of scraping tasks. You can cut in quickly even if the sidebar is collapsed.

Figure

2. Manage silos of tasks at your fingertips

As your project involves multiple scrapers in Octoparse, you are likely to feel encompassed by a number of tasks – some need more attention to prioritize, while others are buried away for later access. In OP 7, we provided multiple filter options for you to sort them, such as cloud status, local status, task type, schedule status, task group, and so on.

In OP 8, it takes the filter to a more fine-grained level. To view all tasks that have been completed during a specific period of time, or tasks with certain records of data extracted, just simply define your filters on the top. You can also name and save your filters to access your desired tasks immediately next time.

Figure

(Octoparse 8 filters)

5. Upcoming Features

 
We expect to have a few more features added in the next release:

  • XPath Tool
  • Data export to JSON format
  • Workflow auto-debug

6. Final Notes

 
You can install both versions on your device. OP 7 supports Windows XP, 7, 8, 10. OP 8.1 supports all the above systems and Mac OS except Windows XP (only supports x64 bit systems).

Before its Full Official Release, we need to make sure the new version can hit the quality bar in the following two weeks. And we can’t make it without your feedback and suggestions. If you haven’t got hands on the 8.1 Beta, take the latest release for a spin. If you encounter any unknown issues please reach out via bug reports or email us directly at support@octoparse.com.

Categories
Data Science Central RSS Feed

Machine Learning for Frontend Development: The Road to Future

When creating a website, it is an essential objective of developers to make it appealing. They undergo a few systems, including drawing idea sketches, building models, and experimenting with the website before pushing it live.

These procedures don’t occur parenthetically. Developers spend many months on constructing a beautiful and responsive website. Be that as it may, the development in technologies is making things more manageable for them.

The cutting edge technologies like artificial intelligence (AI) and machine learning are quickening frontend development and making coding and testing of website format simpler, more agile, and frequently prolific.

What is Deep Learning?

Deep learning comprises two significant parts: training and inference.

Training includes implementing a large dataset to a new neural network and rendering the correct answer, i.e., a large number of inputs with the outputs attached.

The neural network takes all the information and can make an inference when bestowed with a new image based on the past dataset.

After the training, the neural network becomes competent to analyze and understand the input and then return the output. The deep learning models are primarily used for several use cases of AI, like computer vision and natural language processing.

Frontend developers can use related models for designing UI elements by training neural networks with associated data.

Training Neural Networks

There are two types of data that frontend developers can apply to train a neural network. The first one is a graphical user interface (GUI) screenshots, and the other one is whiteboard sketches. The screenshots and designs would hold their corresponding code.

This training will give the neural network ability to generate code from a GUI screenshot called pix2code. Whereas, the conversion of sketches to the system is called sketch2code. After investigating the screenshots and designs, the neural network will learn how to code these images in HTML or CSS. For models with text, the systems will further go through text recognition rounds.

Along with developing websites, equal or more time is spent on detecting and fixing the bugs. Automating the testing process with AI can produce a much-needed shift in the industry. AI can be applied to determine robust testing methods and quick bug fixes, thereby strengthening the criterion of development.

AI can even be used to detect viruses that no other compiler can detect. Microsoft research lab is operating on such a model that can identify the bug. The only thing you have to do is implement a small description of the problem, and the model will write a few lines of code and solve the said problem by itself.

Real-world examples

Enterprises of every size have begun to use this technology to power their UX design.

Airbnb, the largest community-driven hospitality company in the world, is employing AI to create its prototype. According to Benjamin Wilkins, the design technology leader at Airbnb, the time required to test an idea should be zero. With this goal, the company uses the prototype to create a functional design code from sketches.

Originally, Airbnb had trained the prototype with only a dozen hand-drawn sketches, machine learning algorithms, as well as some intermediary code for rendering elements from the design system into a browser.

On the other hand, Uizard, a startup in Copenhagen, uses the notion of transforming images into HTML codes for its entire business. The startup has developed a machine-learning algorithm that can read images and return the customizable and production-ready system linked to the platform.

Uizard is utilizing this concept to the development of mobile and web applications. Similarly, many companies are looking to hire frontend web developers to train them for emerging tech.

Endnotes

Shift is befalling, and technology is evolving at a high pace. As artificial intelligence is becoming more prevalent and universal, these technologies are assumed to find use in every phase of our life. There is no doubt that AI can make the lives of frontend developers simple by automating monotonous tasks.

Whether AI will conceive surpassing humans and start developing self-improving applications prevails as one of the biggest questions that need to be answered in the years to come. This article was to provide a brief idea about the impacts of AI on frontend development at present and in the coming years.

Also, web developers and companies will have the option to foresee the time required for the generation of a task. In the long run, the improvement of time, assets, and design spending strategies can encourage organizations to diminish expenses and increase the degree of profitability (ROI).

Inside the following couple of years, experts predict that AI and machine learning will change how developers take a shot at frontend development. It will expedite the prototyping and break the barriers of programming development.

Categories
Data Science Central RSS Feed

Digital Transformed Business Models Go Beyond Win-Win

A recent article in BusinessWeek titled “Farmers Fight John Deere Over Who Gets to Fix an $800,000 Tractor” highlights a key challenge with which organizations must wrestle on their Digital Transformation journey: how to create win-win scenarios for one’s customers.

The article asks, “Who really owns a John Deere? Is it the farmer who pays $800,000 for it, or the company that won’t let the farmer fix it on their own?”

As a good old Iowa boy (Charles City, IA), I can understand why farmers that John Deere forbids them from servicing their own tractors; that only authorized dealer can perform maintenance on these machines. As the article says:

”Anything a farmer does on a modern tractor, beginning with opening the cab door, generates messages captured by its main onboard computer, which uploads the signals to the cloud via a cellular transmitter located, in many Deere models, beneath the driver’s seat. These machines have been meticulously programmed and tested to minimize hazards and maximize productivity, Deere says, and it’s all too complicated for farmers to be getting involved in.”

Forcing farmers to have all tractor repairs done by a certified John Deere technician is not only more expensive, it delays getting necessary repairs done in a timely manner. “Meantime-to-Repair” is a critical key performance indicator when dealing with crops as unplanned delays in planting and harvesting can quickly turn into lost revenue and shrinking margins.

At stake here is more than just control of the operations of the tractor for John Deere. John Deere is taking this drastic action in part to protect its maintenance and parts business. From the John Deere 2018 Annual Report, we can see that the parts business was a substantial part of John Deere’s $37.4B in revenue (see Figure 1).

image

As the article states:

“Historically, the healthy profit margins of the parts and services units have helped smooth out earnings when demand for machines is down. For Deere and its dealerships, parts and services are three to six times more profitable than sales of original equipment, according to company filings. Farmers need to keep aging equipment running; that helped increase annual parts sales by 22%, to $6.7 billion, from 2013 to 2019, while Deere’s total agricultural-equipment sales plunged 19%, to $23.7 billion.[1]

What are organizations to do to ensure a successful digital transformation? And what role do customers play in that digital transformation (hint: the most important role)?

As I discussed in “Digtial Transformation is Messy, Dirty, and Ugly. Here’s How to Embrace it!”, there is a big difference between Digitalization and Digital Transformation. Digitalization enhances or replaces human-centric process with digital technologies, such as transmitting meter readings to the cloud using sensors on a more granular basis versus having human meter readers travel to the meter to get the reading every couple of weeks (see Figure 2).

image

Unfortunately, Digitalization of existing business processes and products still puts most of the operational and business risk on the customer. Digital Transformation, on the other hand, leverages the economics of data and analytics to re-engineer the the organization’s business models to create new sources of mutual customer value based upon customer, product and operational insights.

And that gets to the heart of my issue with the John Deere situation – all the benefits accrue to John Deere at the expense of their customers. John Deere gets all the product usage data which they use to engineer more profitable tractors and create farmer lock-in while protecting their parts business (through onerous contractual terms). And the farmer gets less agility and flexibility in their ability to fix their own problems in order to reduce unplanned operational downtime which has direct financial impact on their business.

I’d call that a win-lose scenario for the farmers.

However, there is a better business model that benefits both John Deere and the farmers that leverages all the data and analytic investments that John Deere is making in their products while doing a better job of spreading the value and risk between the manufacturer (John Deere) and their customers (farmers). That’s the “Product-as-a-service” or Xaas business model.

Let me explain.

Digital Transformation builds business models that enable win-win customer scenarios: a win for the customer, and a win for the manufacturer. And that’s leading organizations to transition towards a “Capability-as-a-service” business model (called Xaas) in which customers only pay for what services (or time) that they actually use the product. We are already seeing industrial companies move in this direction including:

  • General Electric (GE) is transitioning away from selling traditional industrial products towards selling ‘as-a-service’ packages by wrapping data and digital solutions around its core offering.
  • Kaeser Kompressoren, who manufacturers large air compressors, leverages product usage, performance and condition data from its air compressors to sell air by the cubic meter… Air-as-a-service.

To ensure a viable and profitable Xaas business model, organizations must master the world of data and analytics to uncover superior insights into consumer usage and product performance patterns to determine optimal consumption pricing and customer Service Level Agreements (SLAs).

Yea baby, superior insights required!

Note: See “Xaas Business Model: Economics Meets Analytics” for more insights into the data and analytic requirements for a successful Xaas business model.

image

The win-win for the customer is that the manufacturer now shares in the customers’ operational risks. If it’s planting or harvesting season, then farmers are probably going to want a Service Level Agreement that guarantees 99.5% operational uptime (which is about 50 minutes per week). And the farmer would pay a higher premium to have that level of service during their most important and riskiest times of their operations.

But wait, there’s more to this story.

If you really want to increase the success of your Digital Transformation journey, then start thinking win-win-win. That is, what is the win for the manufacturer, what is the win for your customer, and what is the win for your customer’s customers. If your customer is a theme park operator, then what are the wins that your Xaas consumption model can provide to your Theme Park guests? If your customer is an Agricultural Solutions provider, then what are the wins that your Xaas consumption model can provide your Agricultural Solutions farming customers?

In John Deere’s case, John Deere could adopt a Xaas win-win-win business model that’s a win for them (in providing new products and services to their customers) for their customer (dealerships) in providing new sources of revenue and a win for the dealer’s customers (the farmers) in providing uptime guarantees during the farmers’ risky times of the season. This win-win-win Digital Transformation business model is built upon mutual value and risk sharing for all participants.

There are some design thinking techniques that I have just started to play with – such as Strategic Doing – that drive this win-win-win thinking. And when you are focused on leveraging customer, product and operational insights to re-engineer your business model to drive mutual value for your customers and your customer’s customers, then you are onto something very epic!

[1] “Farmers Fight John Deere Over Who Gets to Fix an $800,000 Tractor

Categories
Data Science Central RSS Feed

Artificial Intelligence: Revolutionary Change for All Business Sector

Today almost all modern technologies are in full swing, and one such technology is Artificial Intelligence (AI). This modern technology just looks like a science fiction story, but presently it is more closer to that. Most of the businesses and entrepreneurs are putting efforts to integrate this advanced technology within their working environment.

According to market research, the artificial intelligence software market globally will experience massive growth during the next few years, the revenues of the market forecasted to increase by ten billion U.S. dollars in 2018, it is estimated that it will grow to an expected 126 billion by 2025. The overall AI software market includes a wide range of software and application, these apps include natural language processing, robotic process automation, machine learning, and much more.

image

Explore to Know-How Artificial Intelligence is Transforming Business Sectors

Nowadays, we have experienced a drastic change in the market, starting from driverless cars to virtual doctors. Modern technology, such as artificial intelligence (AI), has transformed everything. It has changed the way people live, travel, work, and perform most of their tasks.

The recent PwC report represents that AI could add around $15.7 trillion to the world’s economy by 2030. The expertise of British AI organizations and many other companies are leveraging the world’s most innovative, these business sectors are predicted to contribute a great amount to the global economy, this results in increased economic productivity and growth.

image

Artificial Intelligence (AI) can be termed as computerized systems that provide more convenience and efficiency to mankind. Experts research and study new devices to take necessary actions, which increases the chance of success, this can be termed as AI-based research.

Modern day’s computer is getting smarter day by day, and all thanks go to Artificial Intelligence. This advanced technology has much to offer to almost all the business sectors, and most of the industries are also leveraging the advantages which this modern technology offers to them.

AI is offering a staggering pace for almost all business sectors. Gartner’s report estimates that more than one-third of jobs will soon be replaced by robots and smart machines. Google also estimates that AI-based robots will attain a high level of intelligence of humans by the end of the year 2029. In 2021, AI is estimated to generate more than $2.9 trillion in business value, it is also forecasted that this modern technology will help in recovering more than 6.2 billion hours of worker productivity.

Most of the experts predict that modern technology, such as artificial intelligence (AI), will augment existing jobs rather than displace humans from the jobs. Hence it is clear that artificial intelligence technology has a bright future, this advanced technology plays a vital role in people’s lives these days, it helps them improve their performance, it helps them to perform numerous tasks more efficiently and accurately than ever before. Explore the list of major industries that are leveraging the benefit of modern artificial intelligence technology.

Banking

Recent research shows that artificial intelligence has proved to be boon for financial institutions as it helps them to save around $1 trillion in project cost. It also represents that this modern technology will add around $1.2 trillion in value to the financial industry by the end of the year 2035. The financial industry is one of the other industries that have executed artificial intelligence in its business.

Artificial Intelligence is one such technology that allows them to make many effective decisions and offer them various things like natural language processing, semantic technologies, machine learning, and much more. This advanced technology helps the finance sector to meet all the demands of customers. Today most of the finance sectors are considering integrating this modern technology because it forecasts trends in the stock market and allows them to manage all the tasks related to finance more efficiently than ever before.

Technology Market

The technological market has gained high momentum due to advanced artificial intelligence-based solutions. This type of advanced solution is crafted by an angularis development company that enables technological sectors to solve all the complex problems more efficiently. Most of the companies who are dealing in the technological sector are transferring the vehicles and many other devices with autonomous one. They are doing so as it helps them to automate all the work, which ultimately results in improving productivity.

Medicine and Health Care

Most of the business owners who are dealing with the health care sector are making use of AI, after integrating AI-based solutions, they have experienced wandering outcomes in their respected fields. According to the report, AI solutions in the healthcare sector will generate around 6.7 billion U.S. dollars in revenue by the end of 2021, this number will increase more with the passage of time.

image

AI-based equipment and software not only automate almost all the tasks, but it also helps the doctor to monitor all the machines at the same time, this can help them to save a huge amount of time and cost at the same time. Avance solutions developed using artificial intelligence allow doctors to track patients’ data, especially during the critical period. Many of the sophisticated features like facial recognition and many others can be leveraged by the sector when they consider integrating this advanced technology within their working environment.

Government and Defense

One of the recent Defense Advanced Research Project Agency has provided huge funds for AI research. DARPA invested its penny for the development of new technology, which they can use for military purposes. Robots developed using artificial intelligence technology can easily travel into the dangerous terrains compared to humans can’t do the same task more accurately as there is always a risk of losing their life.

AI-driven solutions also help them to perform remote surgeries and also enables them to execute surveillance missions that are beyond the capabilities of humans. Due to the increasing amount of cyber-attacks, military forces and governments feel that they need to integrate advanced AI-based solutions within their working field. This solution can help them to improve cybersecurity and helps to detect attacks at the initial stage.

Retail Businesses

Most of the retail businesses these days use an automated system to provide the best customer service. They provide a unique AI-based platform to their customers where they can easily place their doubts and can accurately answer questions related to it. Automatic texts are sent to customers when they place any query or issue on the platform. This advanced system mimics human speech patterns and generates users with friendly responses related to it.

Summing It Up

Technology has provided service providers and consumers with great change during the past few years. Continuous growth is measured in the adoption of modern technologies such as artificial intelligence, blockchain, augmented reality, virtual reality, and many more. Artificial intelligence (AI) has much to offer to both companies as well as consumers, it enables businesses to make a smooth transition, which helps them in bagging massive success in their business in a shorter duration of time than their imagination.

Businesses can take support from app development companies to craft advanced solutions using these advanced technologies. They can develop an AI-based solution that enables you to reduce the overall working burden of your business staff. Get in touch with any of the professional app development companies who can craft advanced solutions for your business at a pocket-friendly price that is also within the estimated time which you have provided.

Categories
Data Science Central RSS Feed

Simulated Statistics is the New Black

image

Over the years I’ve often been asked by beginners where they should start in statistics, what they should do first, and which parts of statistics they should prioritise to get them to where they want to be (which is usually a higher paid job).

Now, as I’m almost completely self-taught I don’t really consider myself an authority in where one should get started, and I struggle to answer this question with any great conviction.

Sure, I have some thoughts about this subject, but they are coloured by my own experiences.

So I thought I’d reach out to some of our statistics friends to see what they can bring to the party.

Each of the statisticians in this post were asked the same question:

If you had to start statistics all over again, where would you start?

The answers were astounding — they turned out to be a roadmap of how to become a modern statistician from scratch.

In short, how to be a future statistician without ever needing a single lesson!

There is a schism in statistics, and that is between the frequentists and the Bayesians.

Let’s see what the statisticians have to say about this debate.

We start with Kirk Borne (Twitter: @KirkDBorne), astrophysicist and rocket scientist (well, rocket data scientist). Surprisingly, he tells me he’s never never had any interest in being an astronaut!

“I am not a statistician, nor have I ever had a single course in statistics, though I did teach it at a university. How’s that possible?”

Funnily enough, that was the same for me! So where did he get all his stats from?

“I learned basic statistics in undergraduate physics and then I learned more in graduate school and beyond while doing data analysis as an astrophysicist for many years. I then learned more stats when I started exploring data mining, statistical learning, and machine learning about 22 years ago. I have not stopped learning statistics ever since then”.

This is starting to sound eerily like my stats education. All you need to do is drop the ‘astro’ from astrophysics and they’re identical! So what does he think of starting stats all over again?

“I would have started with Bayesian inference instead of devoting all of my early years to simple descriptive data analysis. That would have led me to statistical learning and machine learning much earlier. And I would have learned to explore and exploit the wonders and powers of Bayesian networks much sooner”.

This is also what Frank Harrell, author and professor of biostatistics at Vanderbilt University School of Medicine at Nashville thinks about hitting the reset button on statistics (Twitter: @f2harrell). He told me:

“I would start with Bayesian statistics and thoroughly learn that before learning anything about sampling distributions or hypothesis tests”.

And Lillian Pierson, CEO of Data-Mania (Twitter: @Strategy_Gal) also mentioned Bayesian statistics when I asked her where she would start:

“If I had to start statistics all over again, I’d start by tackling 3 basics: t-test, Bayesian probability & Pearson correlation”.

Personally, I haven’t done very much Bayesian stats, and it’s one of my biggest regrets in statistics. I can see the potential in doing things the Bayesian way, but as I’ve never had a teacher or a mentor I’ve never really found a way in.

Maybe one day I will — but until then I will continue to pass on the messages from the statisticians in here.

Repeat after me:

Learn Bayesian stats.

Learn Bayesian stats.

LEARN BAYESIAN STATS!

As I was reaching out and gathering quotations I got a rather cryptic response from Josh Wills (Twitter: @josh_wills), software engineer at Slack and founder of the Apache Crunch project (he also describes himself as an ‘ex-statistician’):

“Computation before calculus is the pithy answer”, he told me.

This intrigued me, so I asked him if he could elaborate a little, and here is his reply:

“So I think stats can be and is taught in three ways:

1. a set of recipes

2. from the perspective of calculus — mostly integrals and what not, and

3. computationally (like the bootstrap as a fundamental thing)”

“Most folks do the recipes approach, which doesn’t really help with understanding stuff but is what you do when you don’t know calculus”.

Ah, I understand the ‘set of recipes approach’, but I didn’t know anyone was still doing the calculus approach. He went further:

“I was a math major, so I did the calculus based approach, because that’s what you did back in the day. You mostly do some integrals with a head nod to computational techniques for distributions that are too hard to do via integrals. But the computational approach, even though it was discovered last, is actually the right and good way to teach stats”.

Whew, thank God for that — I thought he was saying that we should all learn the calculus approach!

“The computational approach can be made accessible to folks who don’t know calculus, and it’s actually most of what you use in the hard parts of real world statistics problems anyway. The calculus approach is historically interesting, but (and I feel heretical for saying this) it should be relegated to a later course on the history of statistical thought — not part of the intro sequence”.

It’s interesting to see the evolution of statistics in this light and shows just how far we’ve come — and in particular how much computers and computing power have developed over the past couple of decades.

It’s truly mind-blowing to think that when I was doing my PhD 20 years ago it was difficult getting hold of data, and when you did get some, you had to network computers together to get enough computing power. Now we’re all swimming in data and err, well, we still struggle to get enough computing power to do what we want — but it’s still way more than we used to have!

I also got a really interesting perspective from Cassie Kozyrkov, Head of Decision Intelligence at Google (Twitter: @quaesita), who told me that she’d:

“Probably enjoy making a bonfire out of printed statistical tables!”

Well, amen to that, but seriously though, where would you start again with stats?

“Simulation! If I had to start all over again, I’d want to start with a simulation-based approach to statistics”.

OK, I’m with you, but why specifically simulation?

“The ‘traditional’ approach taught in most STAT101 classes was developed in the days before computers and is unnecessarily reliant on restrictive assumptions that cram statistical questions into formats you can tackle analytically with common distributions and those nasty obsolete printed tables”.

Got you. So what exactly have you got against the printed tables?

“Well, I often wonder whether traditional courses do more harm than good, since I keep seeing their survivors making ‘Type III errors’ — correctly answering the wrong convenient questions. With simulation, you can go back to first principles and discover the real magic of statistics”.

Statistics has magic?

“Sure it does! My favorite part is that learning statistics with simulation forces you to confront the role that your assumptions play. After all, in statistics, your assumptions are at least as important as your data, if not more so”.

And when it came to offering his advice, Gregory Piatetsky, founder of KDnuggets (Twitter: @kdnuggets), suggested that:

“I would start with Leo Breiman’s paper on Two Cultures, plus I would study Bayesian inferencing”.

If you haven’t read that paper (which is open access), Leo Breiman lays out the case for algorithmic modelling, where statistics are simulated as a black box model rather than following a prescribed statistical model.

This is what Cassie was getting at — statistical models rarely fit real-world data, and we are left to either try to shoe-horn the data into the model (getting the right answer to the wrong question) or switch it up and do something completely different — simulations!

This is an excerpt of my original post, which is quite long — too long to post here in its entireity (there are more than 30 world-class contributors!).

If you’re enjoying reading, you might be interested to hear what Dez Blanchfield had to say about domain experts, or what Michael Friendly and Alberto Cairo said about the past, present and future of data visualisation.

There’s also a free book to download detailing all the comments made by the contributors, including what Jacqueline Nolis and Kristen Kehrer had to say about starting their careers over.

And don’t get me started with the epic suggestions that Natalie Dean and Jen Stirrup had about Information Flow and Detective Work.

Awesome — you really don’t want to miss them!

Read more here

Categories
Data Science Central RSS Feed

Analytics Translators: Fact or Fiction?

It’s been two years since Mckinsey invented the term analytics translator, called it the ‘new must-have role’ and predicted we’d need around 5 million of them.

FIVE MILLION

That’s a bit more than the entire population of Los Angeles, but just slightly less than all of Norway.

For the past ten years, we’ve struggled with the ambiguous title ‘data scientist’, then ‘citizen data scientist’. Now it’s ‘analytics translator’.

Although I’ve seen many ‘data scientists’ change their Linkedin titles to ‘analytics translator’, the problem remains that no one knows what ‘analytics translator’ really means. Mckinsey seems to have slipped this term into a Harvard Business Review article, and it has somehow taken root. What’s more, people seem truly excited by the term.

“we’ve struggled with the ambiguous ‘data scientist’, then ‘citizen data scientist’. Now it’s ‘analytics translator'”

A Bit of Background

When the University of Amsterdam asked me to begin training professionals in how to become ‘analytics translators’, I had to start formalizing the skill set required to be an ‘analytics translator’.

Since Mckinsey was responsible for inventing the term, I wanted to understand their thinking.

After a bit of digging, I found they were recycling content from a 2016 paper they’d published, which discussed a shortage of ‘business analysts’.

So it seemed that during the 14 months between publication of these two Mckinsey articles, the term ‘business analyst’ had evolved into ‘analytics translator’. They must have decided it was time to introduce a new job title, never mind that we still haven’t clearly defined what a ‘data scientist’ is (whereas half the planet is, was, or will soon be ‘data scientists’).

“the term ‘business analyst’ had evolved’ into ‘analytics translator'”

What Is An Analytics Translator?

Quite simply, then, an analytics translator is someone who can understand business requirements and ‘data science’ possibilities. In this sense, ‘analytics translator’ is a skill set and not necessarily a role or a job title.

This skill set is actually extremely important. In a recent survey by O’reilly, 47% of respondents indicated this as one of the biggest challenges holding back adoption of AI / ML.

” 47% of respondents indicated this as one of the biggest

challenges holding back adoption of AI / ML “

Some companies may have roles dedicated to gather requirements, and in this case the role itself could rightly be titled ‘analytics translator’. It is not, however, an equivalent label for a ‘data scientist’, ‘machine learning engineer’, ‘statistician’ or ‘product owner’. At best, the term ‘analytics translator’ could substitute for the traditionally nebulous title ‘business analyst’.

Even if your company doesn’t use this job title, the skills related to the analytics translator concept are crucial, and you should make an effort to ensure that a large number of your staff are ‘fluent in analytics translator’

What Skills Does An Analytic Translator Need?

Facilitating the execution of data science projects within a business context generally requires an understanding of business goals and processes; a high-level understanding of analytic vocabulary, techniques, technologies, and processes; and the ability to communicate cross-functionally. I’ve grouped these into 3 categories and 9 subjects. OK, maybe 11.

“Analytic translators should have specific skill sets”

Foundational Technical Understanding

  1. Basics of classical statistics (regression, exploratory data analysis, hypothesis testing, correlation, etc)
  2. Overview of common machine learning techniques (deep learning, SVMs, decision trees, adaptive boosting, clustering algorithms, etc)
  3. Overview of technologies commonly used (programming languages, database concepts, deployment tools (docker, cloud), etc)
  4. Understanding the life cycle of model building, training, deployment, and maintenance

Foundations for Collaboration and Communication

  1. Understanding of frameworks and tooling used by data science teams (scrum, kanban, Jira, git, etc)
  2. Stakeholder management: setting expectations, building trust, change management
  3. Techniques of communicating analytic results (as per Stephen Few, Cole Knaflic, etc)

Foundational Business Understanding

  1. Understanding the goals and priorities of the diverse horizontal and vertical elements of organizations in which data science is only one team or department
  2. Choosing the analytics projects most likely to deliver business value in the current economic and corporate environment

Regarding skills 1-4, a large number of data science teams are run by non-technical managers (4). These managers may not need to understand the technical details of team projects, but it is critical that they understand the difference between a high-risk, high-effort technique and a low-risk technique. They should understand the background terminology of project reports so they can focus on the elements that are unique to their business case. They should be able to see through any BS thrown at them.

Regarding skills 5-7, it’s important to understand that much of the common tooling and methodologies (such as scrum) in use today were developed over the past 20 years by and for software developers, rather than data scientists. Data science teams must adopt and adapt the elements which will make them most effective in their unique tasks. Likewise, visual design and communication skills require special considerations when the subject matter is quantitative and especially when the audience is non-technical. I’ve seen very few data scientists communicate well to non-technical audiences without special training.

I sometimes add 2 additional items to this list of skills, as part of a soft-skills training for data scientists, these are

  1. Working in a multi-cultural environment
  2. Dealing with office politics

Why Are Analytics Translator Skills So Important?

Analytics translator skills are crucial for two roles in particular: data science product owner and data science team manager. Without analytics translator skills, neither of these two roles will be able to bring to bear the full potential of data science within a business context. Analytics translator skills are also extremely valuable for data scientists (individuals specialized in advanced analytics), both to steer them in producing business value and in helping them work more effectively with business counterparts.

How Do We Fill Analytics Translator Roles?

I’d agree with the Mckinsey article that the best solution is to train existing staff, rather than hire into an analytics translator role. Existing staff will already have a deep, proprietary knowledge of your business and have already formed relationships with key stakeholders. To illustrate, consider an AI healthcare company here in Amsterdam, who employees licensed medical doctors as product owners for their data science teams. In such cases it’s especially clear that adding analytics translator skills to the hiring conditions is not reasonable, and that providing trainings is the appropriate solution.

“the best solution is to train existing staff,

rather than hire into the role.”

Also, analytics translator skills are sufficiently general that most staff with a healthy dose of curiosity can learn them fairly quickly, and my experience is that a large number of experienced professionals are indeed eager to learn, given the opportunity. Many companies are setting up internal analytics translator trainings. For those without such trainings, part of the skill set can be learned through online courses, but there are relatively few open enrollment programs that teach the combination of business, communication and stakeholder management skills required. I do teach some of these skills at the University of Amsterdam’s Business school, and I often give in-house trainings for analytics translator skills. During these trainings I also have the privilege of hearing participants’ own thoughts on what an analytics translator could or should be, and how the concept differs from other roles.

Would love to hear the opinions of others here. I welcome your comments below.

  1. You Don’t Have to Be a Data Scientist to Fill This Must-Have Analytics Role. Nicolaus Henke , Jordan Levine and Paul McInerney. Harvard Business Review. Feb 2018
  2. The Age of Analytics. Competing in a Data Driven World, Mckisey & Company. 2016
  3. AI Adoption in the Enterprise, How Companies Are Planning and Prioritizing AI Projects in Practice. Oreilly 2019
  4. The State of Machine Learning Adoption in the Enterprise, Oreilly 2018

This article originally appeared as “Analytics Translator: Just another Buzz Word?” on my data science blog.

Categories
RSS Feed Techcrunch

Max Q: Huge week ahead for SpaceX and Virgin Orbit

This week could be the biggest week to date for private spaceflight, with landmark launch attempts coming from both Virgin Orbit and SpaceX .

Virgin Orbit is looking to join the elite club of private launch companies that have actually made it to space, with a full flight test of its combined Cosmic Girl and LauncherOne system. Meanwhile, SpaceX is looking to launch its Crew Dragon spacecraft with people on board – achieving a number of milestones, including returning U.S. crew launch capabilities, and human-rating its Falcon 9 rocket.

Virgin Orbit 87Virgin Orbit was supposed to launch its first full demonstration flight on Sunday, but a sensor bug that showed up during pre-launch checkouts means that it’s now pushing things back to at least Monday to check that out.

Extra precaution is hardly surprising since this milestone mission could help the company become an operational satellite launch provider – one of only a small handful of private companies that can make that claim.

SpaceX passed its first crucial flight readiness review (FRR) on Friday for its first ever crewed astronaut launch, setting it up for a full rehearsal of the mission on Saturday leading up to the actual launch Now it’s set for another FRR with partner NASA on Monday, and then the launch should take place on Wednesday – weather and checkouts permitting. This will definitely be one to watch.

MHI H IIB HTV8 10 1

Mitsubishi Heavy Industries flew its last mission with its H-II series rocket, and the space transfer vehicle it carries to deliver supplies to the International Space Station. The company is readying a successor to this highly successful and consistent rocket, the H3, which is set to make its launch debut sometime in 2022 if all goes to plan.

While SpaceX is aiming to make history with NASA and two of its astronauts, the person in charge of the agency’s human spaceflight endeavors made a surprising and abrupt exit from the agency last week. Doug Loverro resigned from his position, reportedly over some kind of inappropriate activity he engaged in with a prospective agency business partner ahead of the contract awards for NASA’s commercial human lander program.

Xilinx specializes in building processors that are designed to withstand the rigors of use in space, which include heavy radiation exposure, extreme temperatures and plenty more. The company just debuted a new FPGA for space-based applications that is the first 20nm-based processor for space, and the first with dedicated machine-learning capabilities built in for edge computing that truly redefines the term.

Space has enjoyed a period of being relatively uncontested when it comes to international squabbles – mostly because it’s hard and expensive to reach, and the benefits of doing so weren’t exactly clear 30 to 40 years ago when most of those rules were set up. NASA’s new rules include a lot of the old ones, but also set up some modernizations that are sure to begin a lot of debate and discussion in the space policy community.

100330 O 1234S 001

In a testing procedure, the X-37B Orbital Test Vehicle taxis on the flightline March 30, 2010, at the Astrotech facility in Titusville, FLa. (Courtesy photo)

The United Launch Alliance launched the X-37B last week on behalf of the U.S. Space Force – marking the first time the mysterious experimental unscrewed space plane has launched for that newly-formed agency. The X-37B has flown plenty before, of course – but previously it was doing so under the authority of the U.S. Air Force, since the Space Force hadn’t been formed yet.

Categories
Kdnuggets RSS Feed

Top Stories, May 18-24: The Best NLP with Deep Learning Course is Free

 
 



Also: Automated Machine Learning: The Free eBook; Sparse Matrix Representation in Python; Build and deploy your first machine learning web app; Complex logic at breakneck speed: Try Julia for data science


Most Popular Last Week

  1. newStanfordThe Best NLP with Deep Learning Course is Free, by Matthew Mayo
  2. newBuild and deploy your first machine learning web app, by Tirthajyoti Sarkar
  3. newAutomated Machine Learning: The Free eBook
  4. newComplex logic at breakneck speed: Try Julia for data science
  5. newEasy Text-to-Speech with Python
  6. newAn easy guide to choose the right Machine Learning algorithm
  7. newWhat they do not tell you about machine learning

Most Shared Last Week

  1. The Best NLP with Deep Learning Course is Free, by Matthew Mayo – May 22, 2020.
  2. Automated Machine Learning: The Free eBook, by Matthew Mayo – May 18, 2020.
  3. Sparse Matrix Representation in Python – May 19, 2020.
  4. Build and deploy your first machine learning web app – May 22, 2020.
  5. 13 must-read papers from AI experts – May 20, 2020.
  6. An easy guide to choose the right Machine Learning algorithm – May 21, 2020.
  7. Appropriately Handling Missing Values for Statistical Modelling and Prediction – May 22, 2020.

Most Popular Past 30 Days

  1. newFive Cool Python Libraries for Data Science
  2. increaseThe Super Duper NLP Repo: 100 Ready-to-Run Colab Notebooks
  3. newNatural Language Processing Recipes: Best Practices and Examples
  4. newThe Best NLP with Deep Learning Course is Free
  5. decrease24 Best (and Free) Books To Understand Machine Learning
  6. decreaseFree High-Quality Machine Learning & Data Science Books & Courses: Quarantine Edition
  7. newDeep Learning: The Free eBook

Most Shared Past 30 Days

  1. The Best NLP with Deep Learning Course is Free – May 22, 2020.
  2. Automated Machine Learning: The Free eBook – May 18, 2020.
  3. Beginners Learning Path for Machine Learning – May 05, 2020.
  4. AI and Machine Learning for Healthcare – May 14, 2020.
  5. Natural Language Processing Recipes: Best Practices and Examples – May 01, 2020.
  6. Deep Learning: The Free eBook – May 04, 2020.
  7. Start Your Machine Learning Career in Quarantine – May 11, 2020.
Categories
Kdnuggets RSS Feed

Python For Everybody: The Free eBook

 
 



Get back to fundamentals with this free eBook, Python For Everybody, approaching the learning of programming from a data analysis perspective.


It’s a new week, which means it’s also time to profile and share a new free eBook. This week we get back to basics with Python For Everybody, written by Charles R. Severance, a book intended to develop or strengthen your foundational Python programming skills.

Figure

Python For Everybody was written as an accompanying text for Python for Everybody Specialization on Coursera, Python for Everybody (2 courses) on edX, and Python for Everybody (2 courses) on FutureLearn, all of which were also created by the book’s author.

This book is particularly suited to individuals looking to learn Python in the context of data science and data analytics, according to the author:

The goal of this book is to provide an Informatics-oriented introduction to programming. The primary difference between a computer science approach and the Informatics approach taken in this book is a greater focus on using Python to solve data analysis problems common in the world of Informatics.

First off, you should know that Python for Everybody uses Python 3, though an older version of the book using Python 2 is still available should you, for some reason, want to learn Python 2 (you should definitely not want to do so, however). It is also code-centric, not spending much time on programming theory but rather jumping right to implementation.

Simply put, Python For Everybody teaches you what you need to know about Python to get writing practically useful code right now, particularly from a data analysis perspective.

The book’s table of contents are as follows:

  1. Why should you learn to write programs?
  2. Variables, expressions, and statements
  3. Conditional execution
  4. Functions
  5. Iteration
  6. Strings
  7. Files
  8. Lists
  9. Dictionaries
  10. Tuples
  11. Regular expressions
  12. Networked programs
  13. Using Web Services
  14. Object-oriented programming
  15. Using Databases and SQL
  16. Visualizing data

A review of the Kindle version of this book on Amazon states the following:

I have not found a better beginner Python book out there. Plus, now that I am a professional Python programmer, I find myself constantly referring to this book to clarify certain points and reinforce understanding of basic principles. (I may be a professional, but I’m still a rookie.) I’ve purchased a few fat, expensive Python programming books with animals on the cover, but these tend to collect dust. In short, I have a new-found appreciation for this book and how much work went in to writing it. Thanks, Professor Severance!

And it’s not the only positive review; 448 ratings of the book with an average of 4.6 out of 5 should tell you that many others have also found Python for Everybody useful. The consensus seems to be that the book quickly covers concepts, does so in an easily understandable manner, and jumps right into the corresponding code.

Aside from English, the book is also available in Spanish, Italian, Portuguese, and Chinese. You can find further information and links to these editions on the book’s website.

Download the PDF here. You can optionally read the book as a series of interactive Jupyter notebooks here. If you like the book and want to support the author, paperback and electronic (Kindle) copies can be purchased on Amazon.

If you are new to data science and are looking to get a grip on one of the field’s most dominant programming languages, freely-available Python for Everybody is a book that should be at the top of your list.

 
Related:

Categories
Kdnuggets RSS Feed

10 Useful Machine Learning Practices For Python Developers

 
 



While you may be a data scientist, you are still a developer at the core. This means your code should be skillful. Follow these 10 tips to make sure you quickly deliver bug-free machine learning solutions.


By Pratik Bhavsar, Remote NLP engineer.

Sometimes as a data scientist, we forget what we are paid for. We are primarily developers, then researchers, and then maybe mathematicians. Our first responsibility is to quickly develop solutions that are bug-free.

Just because we can make models doesn’t mean we are gods. It doesn’t give us the freedom to write crap code.

Since my start, I have made tremendous mistakes and thought of sharing what I see to be the most common skills for ML engineering. In my opinion, it’s also the most lacking skill in the industry right now.

I call them software-illiterate data scientists because a lot of them are non-CS Coursera baptized engineers. And, I myself have been that. 😅

If it came to hiring between a great data scientist and a great ML engineer, I would hire the latter.

Let’s get started.

1. Learn to write abstract classes

Once you start writing abstract classes, you will know how much clarity it can bring to your codebase. They enforce the same methods and method names. If many people are working on the same project, everyone starts making different methods. This can create unproductive chaos.

2. Fix your seed at the top

Reproducibility of experiments is a very important thing, and seed is our enemy. Catch hold of it. Otherwise, it leads to different splitting of train/test data and different initialisation of weights in the neural network. This leads to inconsistent results.

3. Get started with a few rows

If your data is too big and you are working in the later part of the code, like cleaning data or modeling, use nrows to avoid loading the huge data every time. Use this when you want to only test code and not actually run the whole thing.

This is very applicable when your local PC config is not enough to work with the datasize, but you like doing development on local on Jupyter/VS code/Atom.

df_train = pd.read_csv(‘train.csv’, nrows=1000)

4. Anticipate failures (the sign of a mature developer)

Always check for NA in the data because these will cause you problems later. Even if your current data doesn’t have any, it doesn’t mean it will not happen in the future retraining loops. So keep checks anyway. 😆

print(len(df))

df.isna().sum()

df.dropna()

print(len(df))

5. Show the progress of processing

When you are working with big data, it definitely feels good to know how much time is it going to take and where we are in the whole processing.

Option 1 — tqdm

Option 2 — fastprogress

image

6. Pandas can be slow

If you have worked with pandas, you know how slow it can get some times — especially groupby. Rather than breaking our heads to find ‘great’ solutions for speedup, just use modin by changing one line of code.

import modin.pandas as pd

7. Time the functions

Not all functions are created equal.

Even if the whole code works, it doesn’t mean you wrote great code. Some soft-bugs can actually make your code slower, and it’s necessary to find them. Use this decorator to log the time of functions.

8. Don’t burn money on cloud

Nobody likes an engineer who wastes cloud resources.

Some of our experiments can run for hours. It’s difficult to keep track of it and shut down the cloud instance when it’s done. I have made mistakes myself and have also seen people leaving instances on for days.

This happens when we work on Fridays and leave something running and realise it on Monday. 😆

Just call this function at the end of execution, and your ass will never be on fire again!

But wrap the main code in try and this method again in except as well — so that if an error happens, the server is not left running. Yes, I have dealt with these cases too. 😅

Let’s be a bit responsible and not generate CO2. 😅

9. Create and save reports

After a particular point in modeling, all great insights come only from error and metric analysis. Make sure to create and save well-formatted reports for yourself and your manager.

Anyway, management loves reports, right? 😆

10. Write great APIs

All that ends bad is bad.

You can do great data cleaning and modeling, but still, you can create huge chaos at the end. My experience with people tells me many are not clear about how to write good APIs, documentation, and server setup.

Below is a good methodology for a classical ML and DL deployment under not too high load — like 1000/min.

Meet the combo — Fastapi + uvicorn

  • Fastest— Write the API in fastapi because its the fastest for I/O bound as per this, and the reason is explained here.
  • Documentation— Writing API in fastapi gives us free documentation and test endpoints at http:url/docs → autogenerated and updated by fastapi as we change the code
  • Workers— Deploy the API using uvicorn

Run these commands to deploy using 4 workers. Optimise the number of workers by load testing.

pip install fastapi uvicorn
uvicorn main:app –workers 4 –host 0.0.0.0 –port 8000

image

Original. Reposted with permission.

Related: